Closed djcomlab closed 7 years ago
Looking at MTBLS2 in the a_mtbl2_metabolite profiling_mass spectrometry 3.txt
file, I can see that there are no values given in the Labeled Extract Name
and Label
columns on the Chromatography
node.
It looks like this is causing unexpected pooling as the generated JSON from isatab2json
only produces one Chromatography
process #process/Chromatography1
.
After discussion with @proccaserra, it was determined that the assumption should be that if there is no Labeled Extract Name
provided, the row should be parsed 1-1 directly across with no pooling. Pooling should occur only if there are Labeled Extract Name
values provided that combine by having the same value in distinct rows.
This should apply to other node types that use Name
column disambiguations.
Try something like adding an artificially generated column Parameter Value[run order]
with unique values to Chromatography
, so the parser can disambiguate Chromatography events.
After further discussions with the @agbeltran @proccaserra, we note that the algorithm that builds the process sequence works to detect each process event by looking for differences in Parameter Value
s as well as extra columns such as ___ Name
columns and so actually in this MTBLS2 example the error should not in the parser, but rather in how the data is coded.
We should try and detect early pooling in the graphs with the ISA-Tab validator to warn of these cases (new issue opened for validator to detect this in #160).
However after modifying MTBLS2's assay table file by adding a Parameter Value[curation run order]
with unique values (a count) to disambiguate the process events, the conversion still creates the same output with a single Chromatography process in the process sequence.
Further testing confirms that the way to encode the ISA-Tab so it does not split on Chromatography
is to put unique values in the Labeled Extract Name
column.
I tried both adding a Parameter Value[curation run order]
column and also using the Label
with unique values and in both these cases the pooling still occurs in the JSON on a single process #process/Chromatography1
.
Currently, we plan to implement feature in #160 to see how many occurrences of the pooling problem we can detect in MetaboLights ISA-Tabs to estimate how much work it would take to update ISA-Tabs on their side.
Pooling test run on MetaboLights reveals many examples of possible pooling, the most common such as:
Possible process pooling detected on: #process/Chromatography1
Possible process pooling detected on: #process/Extraction1
Possible process pooling detected on: #process/Data_transformation1
Possible process pooling detected on: #process/Metabolite_identification1
On the data transformation and metabolite identification, I believe the pooling is expected (as it converges on a MAF file.
Where Chromatography and Extraction processes seem to pool seems to indicate that there's no values in the __ Name
column and the rest of the parameters are all the same, and so the ISA-Tab parser interprets this as one process. In our tests, we ran this test on MTBLS1-MTBLS156 and revealed these kinds of pooling in 115 (Extraction) and 33 (Chromatography) instances.
yes, pooling expected for data transformation.
so the parser does not understand: i. 'optional' Labeled Extract Name if the header is present but the column is empty (when using a configuration which indicate that such material node may be present. What happens if a config has not such node? ii. protocol chains where input and output are all distincts (n input, n output). In this situation, i would have thought a 'run order' or 'date' field would be able to indicate distinct events.
did the tests you ran also tested different configuration layouts?
The test does not consider configurations. All it does is converts the ISA-Tab in MetaboLights to ISA JSON using the isatab2json
converter (which uses the ISA-Tab parser), then we load the JSON into the model that in-turn calculates the real graph to report the pooling.
We already talked about using some kind of Parameter Value
to add something like a run order or process ID (like a meaningless hash of some sort), although we noticed that PVs are not currently used in to determine uniqueness in the parser (Alex I think is doing a fix for this), but I think qualifiers (data and performer) and in/outputs are.
If the optional Name column is not present, it still pools (I tried on MTBLS1).
When converting the ISA-Tab of MTBLS2 (http://www.ebi.ac.uk/metabolights/MTBLS2), I noticed that the JSON produced only has one Mass Spectrometry process in the assay process sequence (
Mass_Spectrometry1
) with noname
, and it lists all of the assay data files in theoutputs
from this single process.In this case there is supposed to be one MS process per sample (as the assay file shows MS processes uniquely named in the MS Assay Name column, on each line), with one data file each as output.