MassBank / RMassBank

Playground for experiments on the official http://bioconductor.org/packages/devel/bioc/html/RMassBank.html
Other
12 stars 15 forks source link

Include scan number in output spectra/MassBank records #218

Open schymane opened 5 years ago

schymane commented 5 years ago

See discussion https://github.com/MassBank/MassBank-data/issues/79

Including the scan number will allow users to extract also the raw data - this will require also that we cross-link our records to the raw data (same discussion)

schymane commented 5 years ago

We (@tsufz and I) would propose to have a new field to do this: MS$DATA_PROCESSING: SCAN_NUMBER 1234 and we will post an issue in MassBank-web @meier-rene @Treutler

@MaliRemorker @adelenelai the scan number is already used/available in the RMassBank workflow, do you want to investigate further? Will help debugging if it's in the records not just the failpeaks list.

meowcat commented 5 years ago

Hmm... On what time scale is this important? I'm asking because in the best of cases, I would like to substitute the actual record generation step by simply casting into a template using github.com/MSnio. It can easily be hacked in already, of course.

schymane commented 5 years ago

I would prefer a quick hack for now so that we can start playing around with linking raw data

meier-rene commented 5 years ago

I would appreciate it if you could create a minimal example for me.

schymane commented 5 years ago

Do you need a minimal example for RMassBank? Or MassBank-data?

meier-rene commented 5 years ago

Just an example record file.

schymane commented 5 years ago

Basically, the scan number is already stored internally in RMassBank, we just need to print it out - but haven't yet actually done this. There also isn't the accepted tag yet - do you agree with MS$DATA_PROCESSING: SCAN_NUMBER 1234 ?

meier-rene commented 5 years ago

For me this is fine. I could include this into the record spec file. But if I understand it correctly this is related to the raw data file linking. Is it directly connected or does it make sense without the raw data linking?

schymane commented 5 years ago

It makes less sense without the raw data linking, but would still be useful. For instance, we are now doing a prescreening workflow internally and if we can actually print the scan number into the record, it helps us debug our results in the vendor file. Eventually our plans are that when we upload the records, we will also deposit the raw data with the records and add those respective fields as well. For instance here is our "fail peak" list with scan number: image

and this would mean that e.g. https://massbank.eu/MassBank/RecordDisplay.jsp?id=EA281702 would have MS$DATA_PROCESSING: SCAN_NUMBER 558

meier-rene commented 5 years ago

Ok, then I will implement this two items independently, which makes it easier.

schymane commented 5 years ago

Makes sense. Then people will be able to link raw data without specifying the scan, and specify scan without linking raw data. I see that both cases will happen, even if the ideal is that they are coupled, we do not want to eliminate one because the other is missing.

meowcat commented 5 years ago

Is DATA_PROCESSING the right tag for this? Shouldn't DATA_PROCESSING say which steps were taken to process the data?

Retention time, which is analogous to scan number, is in CHROMATOGRAPHY. (But I understand that here the important thing is the link to the raw data and not so much the "time" dimension of the scan #.)

By the way, how would you deal with spectra that are derived from multiple scans?

meowcat commented 5 years ago

Maybe we shouldn't be too shy to implement a new MS$ tag for provenance, but I don't know how hard this is on the database side.

schymane commented 5 years ago

@tsufz and I iterated through a few options and ended up at DATA_PROCESSING as the best but not perfect solution. A new MS$ tag would be an idea too. So something like this?

MS$RAW: SCAN_NUMBER 1234

or MS$SCAN: 1234 1235 1236 1237 (space separated in the case of multiple? other suggestions?)

[the rest for the record were] MS$RAW: DOI ... MS$RAW: GNPS .... MS$RAW: METABOLIGHTS ... MS$RAW: METABOLOMICSWB ... MS$RAW: ZENODO ...

sneumann commented 5 years ago

Hi, I would like to keep this not too far from the mzML specification http://www.psidev.info/mzML where spectrum references have been discussed in-depth. There are two flavours in http://www.peptideatlas.org/tmp/mzML1.1.0.html#spectrum: 1) The index, which is The zero-based, consecutive index of the spectrum in the SpectrumList. 2) The id which is The native identifier for a spectrum. which e.g. captures the function for Waters, with examples in http://www.peptideatlas.org/tmp/mzML1.1.0.html#sourceFile

And we do need to be able to reference multiple spectra, which could be a comma-separated list, or a dash-separated range.

We also need the raw data filename and/or direct download URL, to which we index. This is in addition to the DOI or MTBLS accession number. DOI might only refer to a ZIP file.

Yours, Steffen

tsufz commented 5 years ago

@sneumann, This is quite different in the Thermo export: <spectrum index="0" id="controllerType=0 controllerNumber=1 scan=1" defaultArrayLength="234" dataProcessingRef="pwiz_Reader_Thermo_conversion">

In the original data, there is no index, just a scan number. I suggest to go with: MS$RAW: SCAN 1234

I also suggest to separate different scans with a SPACE in a vector, which is MB record format standard. Alternatively, it could be a list such as the peak list.

tsufz commented 5 years ago

MS$RAW: SCAN 1234 was a mistake, but I like the idea to reduce the number of main tags and to work with subtags to avoid main tag inflation...

tsufz commented 2 years ago

I talked with @meier-rene about the topic. We should pick this up again to enhance provenance of the records, also in relation to NFDI4Chem. Thus,

tsufz commented 2 years ago

In this context, we spoke also about the better opportunity to link to external repositories (as mentioned by @schymane in https://github.com/MassBank/RMassBank/issues/218#issuecomment-507709057). This is also required for NFDI4Chem to link for example to a zenodo or radar4chem repository entry containing the raw data files.

tsufz commented 2 years ago

We want also to include the ROR, for example UFZ

sneumann commented 2 years ago

I would love to see MS$RAW: USI ... (see https://www.psidev.info/usi), e.g. mzspec:MTBLS2:MSpos-Ex1-Col0-48h-Ag-2_1-A,1_01_9820.mzML:scan:11850

meowcat commented 2 years ago

Do we need any MS$RAW that is not USI? I guess yes, for referencing "local" files that are not in a repo? How do we make that meaningful - just a filename is not saying anything about provenance, should we use an md5 or sha-something of the file? Either way, I believe this issue is currently not an RMassBank issue but a record format issue.