Rappsilber-Laboratory / build-xiview

GNU General Public License v3.0
2 stars 0 forks source link

fasta files getting ignored when uploaded with mzIdentML #13

Closed colin-combe closed 10 months ago

colin-combe commented 5 years ago

@lutzfischer - actually, as far as I can tell, the mzIdentML schema doesn't explicitly allow you to do this. I.e. there's no attribute of DBSequence that can definitely be used to find the sequence in the associated SequenceDatabase? (p.39 of 1.2 spec)

If the (optional) Seq element is missing from DBSequence we can look up its canonical sequence using its (required) Accession attribute.

Accession could also probably be used to look up sequence in FASTA file, but there's no attribute that specifically gives the identifier used in the FASTA file / sequence database.

I think there's less of an issue here than I thought

lutzfischer commented 5 years ago

For csv-input, if a fasta-file is provided xiFDR will write out the sequence for the target-proteins. For decoys it just writes out the size of the protein (it does not have the sequence for now).

colin-combe commented 5 years ago

If you agree that the mzIdentML doesn't allow you make this connection to FASTA files then I will close this.

I'm not sure... I would think it would, but there's nothing in the spec that says "this the identifier from the sequence database"

colin-combe commented 5 years ago

I think the best thing here is to update to documentation on the upload page to say FASTA files are only used in conjunction with CSV files

lutzfischer commented 5 years ago

On the positive site that means no FASTA-file needs to be uploaded when using mzIdentML. But I think it might still be worthwhile looking into matching the data somehow from the mzid to the fasta - as not all will have a sequence.

colin-combe commented 5 years ago

On the positive site that means no FASTA-file needs to be uploaded when using mzIdentML.

Yes, the upload page should be updated to provide feedback about this, in short term at least document it

worthwhile looking into matching the data somehow

we should be looking for ways to make things simpler not more complicated - so I think just put sequence info in mzIdentML

lutzfischer commented 5 years ago

we should be looking for ways to make things simpler not more complicated

yes I agree with the sentiment - but from the other side of the ocean. Not all software will probably export the sequence. So for the user it is simpler if the software can look up the accession in the FASTA-file.

For xiFDR we can do that - and it will export the sequence. Other sources of mzIdentML we have no control about how it is exported.

Two options I see is that given an accession you could try to retrieve the sequence from uniprot. Or give a fasta-file - split the fatsa-header at anything not [A-Za-z0-9] and see if anything matches the accession. Actually what I use in xiFDR is something like

String[] parts = fastaheader.substring(1).split("[\\|\\s\\t\\.]");

Then build up a an dictionary from the parts to the proteins that these where found in. If you want it clean you could keep track of ambiguous substrings and remove them from the lookup.

colin-combe commented 5 years ago

you could try to retrieve the sequence from uniprot.

assuming things are currently working correctly (you may tell me they aren't) then this is what it should already do

if the Seq element and the length attribute are missing then it looks up the canonical sequence from uniprot using the accession number which is required

if length is present but Seq is missing you get a list of 'X's that length

we're really only saying you must provide the sequence in mzIdentML if you have non-cannonical one, otherwise accession is sufficient

i'll check this is working... let me know if you find it isn't

lutzfischer commented 5 years ago

At least when I tested it while I reported issue #31 - the accession for the decoy was the proper uniprot accession for HSA (P02768) and the protein was displayed as dot with no sequence attached to it. Also when I did not export a sequence for the target - both where shown as dot/protein of length 1. So as of yesterday that did not work for me

colin-combe commented 5 years ago

you're right, its not working

colin-combe commented 5 years ago

it should be fixed now, and use accession to get sequence if both Seq and length are missing

colin-combe commented 10 months ago

closed, its a requirement for this version that mzIdentML files contain sequences.

Though also I was wrong to say above that mzIdentML doesn't let you reference an associated FASTA file. It does and actually this should be fixed (primarily an issue with the mzIdentML reader)