SynBioDex / libSBOLj

Java Library for Synthetic Biology Open Language (SBOL)
Apache License 2.0
37 stars 24 forks source link

FASTA parser should generate unique ids #591

Open cjmyers opened 5 years ago

cjmyers commented 5 years ago

A FASTA file like the one below will generate three sequences with the same id (_1T38). It should instead generate unique ids.

1T38:A|PDBID|CHAIN|SEQUENCE MRGSHHHHHHGSMDKDCEMKRTTLDSPLGKLELSGCEQGLHEIKLLGKGTSAADAVEVPAPAAVLGGPEPLMQCTAWLNA YFHQPEAIEEFPVPALHHPVFQQESFTRQVLWKLLKVVKFGEVISYQQLAALAGNPKAARAVGGAMRGNPVPILIPSHRV VCSSGAVGNYSGGLAVKEWLLAHEGHRL 1T38:B|PDBID|CHAIN|SEQUENCE GCCATGGCTAGTA 1T38:C|PDBID|CHAIN|SEQUENCE TACTAGCCATGGC

jakebeal commented 5 years ago

Shouldn't the IDs be 1T38:A, 1T38:B, and 1T38:C?

cjmyers commented 5 years ago

Currently the header line is split at ":" with left side being displayId and right side being description. I swear I saw this mentioned somewhere, but I now cannot find it. I did find this interesting blog post about this issue:

http://www.acgt.me/blog/2013/6/25/the-fasta-file-format-a-showcase-for-the-best-and-worst-of-b.html

Apparently, the whole line must be unique, but any subset of it is not guaranteed to be unique. Probably the only solution is to take the header line and hash it to try to come up with some sort of unique id.

jakebeal commented 5 years ago

I think that making a hash is a good way to check if you're dealing with duplicates or not.

I would strongly suggest staying away from using hashes in names except when forced, however, as that will tend to break the relationship with UIDs used elsewhere. Searching for 1T38, for example, brings up what you'd want to find for these sequences in a whole bunch of databases, so it can't just be delegated to provenance.

My suggestion:

  1. Record the most common formats (e.g., NCBI) and try parsing with them to see if we can be assured of having the right ID.
  2. Check for duplicates using hashing.
  3. If we don't understand the header and have a conflict, then as a fallback, differentiate with ID-hash[6 char hash]