gaurav / taxondna

Taxonomy-aware DNA sequence processing toolkit
http://www.ggvaidya.com/taxondna/
GNU General Public License v2.0
31 stars 10 forks source link

Produce a warning if the sum total of charsets is greater than that of the file #51

Open gaurav opened 14 years ago

gaurav commented 14 years ago

Specifically, was data dropped (charsets only cover 1-100 of a 300 bp file) or was data duplicated (charsets 1-300, 250-500 of a 500 bp file).

gaurav commented 14 years ago

We could put in a very simple check that looks for consistency amongst the returned FromToPairs, i.e. if FTPs go from 1-200, do we have every single position in that range accounted for. Combined with a report of the total number of base pairs in the file (perhaps stored in the SequenceList?), we might have everything we need to verify that all the data from the file has been successfully brought over.

The best implementation would be some form of consistency checking in individual Nexus/TNT file loaders, but that'd be more complicated. But it would be neater ...

gaurav commented 14 years ago

Incidentally, we're already checking overlapping datasets as of 1.7.6. So we only need to worry about data dropped.