Raised an error when two sequences with the same name are input (FASTA or FASTQ)

AndreaGuarracino commented 4 years ago

Hi @ekg, now the IDs have to be unique for the input sequences, both for FASTA and FASTQ files. I tried to be concise. In this way we should avoid other oddities that have happened to us recently with the SARS-CoV-2 pangenome plus its gene annotations.

ekg commented 4 years ago

I've merged this to avoid further problems like this, but this could become a problem for memory if the number of paths is very large. This can happen in the case of seqwish graphs induced from reads for instance. If the read names are a large fraction of the size of the entire read set, then we might get a very big std::unordered_set.

It'd be best to put this on disk somehow, or detect it during the construction of the seqindex. An easy way would be to query the compressed suffix array that holds the sequence names for each sequence name and detect duplicates there. If more than one occurrence of a path name is found in the CSA, then the same error should be thrown. This would not increase the memory usage at all, although there would be a runtime cost.

ekg commented 4 years ago

You would run the check with this function: https://github.com/ekg/seqwish/blob/master/src/seqindex.cpp#L214-L221, looking up each sequence in order after the construction of the csa has been done.

You would get the same effect as this PR by changing the assert there to provide the same descriptive error that you have.

ekg commented 4 years ago

The assert in question: https://github.com/ekg/seqwish/blob/master/src/seqindex.cpp#L219

ekg / seqwish

Raised an error when two sequences with the same name are input (FASTA or FASTQ) #58