aitgon / vtam

MIT License
3 stars 3 forks source link

Variant sequences in double #21

Open meglecz opened 3 years ago

meglecz commented 3 years ago

In the sqlite database occasionally there are sequences in upper case and in lower case. Some sequences are identical (apart from the lc/uc). Sequences in lc do not have read counts.

I guess that this comes from using taxassign for sequences that are not yet in the sqlite db, all sequences are added to the db (in lower case letters), even if they are identical to a variant already in the database (upper case letters). In this way, the same sequence can have different varIDs. I would prefer to eliminate his redundancy, and use the same ID systematically for identical sequences.