Closed colin-combe closed 10 months ago
For csv-input, if a fasta-file is provided xiFDR will write out the sequence for the target-proteins. For decoys it just writes out the size of the protein (it does not have the sequence for now).
If you agree that the mzIdentML doesn't allow you make this connection to FASTA files then I will close this.
I'm not sure... I would think it would, but there's nothing in the spec that says "this the identifier from the sequence database"
I think the best thing here is to update to documentation on the upload page to say FASTA files are only used in conjunction with CSV files
On the positive site that means no FASTA-file needs to be uploaded when using mzIdentML. But I think it might still be worthwhile looking into matching the data somehow from the mzid to the fasta - as not all will have a sequence.
On the positive site that means no FASTA-file needs to be uploaded when using mzIdentML.
Yes, the upload page should be updated to provide feedback about this, in short term at least document it
worthwhile looking into matching the data somehow
we should be looking for ways to make things simpler not more complicated - so I think just put sequence info in mzIdentML
we should be looking for ways to make things simpler not more complicated
yes I agree with the sentiment - but from the other side of the ocean. Not all software will probably export the sequence. So for the user it is simpler if the software can look up the accession in the FASTA-file.
For xiFDR we can do that - and it will export the sequence. Other sources of mzIdentML we have no control about how it is exported.
Two options I see is that given an accession you could try to retrieve the sequence from uniprot. Or give a fasta-file - split the fatsa-header at anything not [A-Za-z0-9] and see if anything matches the accession. Actually what I use in xiFDR is something like
String[] parts = fastaheader.substring(1).split("[\\|\\s\\t\\.]");
Then build up a an dictionary from the parts to the proteins that these where found in. If you want it clean you could keep track of ambiguous substrings and remove them from the lookup.
you could try to retrieve the sequence from uniprot.
assuming things are currently working correctly (you may tell me they aren't) then this is what it should already do
if the Seq element and the length attribute are missing then it looks up the canonical sequence from uniprot using the accession number which is required
if length is present but Seq is missing you get a list of 'X's that length
we're really only saying you must provide the sequence in mzIdentML if you have non-cannonical one, otherwise accession is sufficient
i'll check this is working... let me know if you find it isn't
At least when I tested it while I reported issue #31 - the accession for the decoy was the proper uniprot accession for HSA (P02768) and the protein was displayed as dot with no sequence attached to it. Also when I did not export a sequence for the target - both where shown as dot/protein of length 1. So as of yesterday that did not work for me
you're right, its not working
it should be fixed now, and use accession to get sequence if both Seq and length are missing
closed, its a requirement for this version that mzIdentML files contain sequences.
Though also I was wrong to say above that mzIdentML doesn't let you reference an associated FASTA file. It does and actually this should be fixed (primarily an issue with the mzIdentML reader)
@lutzfischer - actually, as far as I can tell, the mzIdentML schema doesn't explicitly allow you to do this. I.e. there's no attribute of DBSequence that can definitely be used to find the sequence in the associated SequenceDatabase? (p.39 of 1.2 spec)
If the (optional) Seq element is missing from DBSequence we can look up its canonical sequence using its (required) Accession attribute.
Accession could also probably be used to look up sequence in FASTA file, but there's no attribute that specifically gives the identifier used in the FASTA file / sequence database.
I think there's less of an issue here than I thought