Rfam / rfam-production

Rfam production pipeline
Apache License 2.0
5 stars 3 forks source link

Remove duplicates from fasta files #40

Open kalvari opened 6 years ago

kalvari commented 6 years ago

Fasta files contain duplicates as they were generated before the de-duplication of full_region table. Need to remove the duplicates and find a way to prevent that in the future

kalvari commented 6 years ago

This is correct! Perhaps only extract full hits which guarantees that all the sequences belong to a genome

kalvari commented 6 years ago

Competing seed sequences with full should solve the issue and always keeping the seed in such cases.

kalvari commented 6 years ago

I think it makes sense to only export full_region hits