Unexpected pooling on MS process in JSON output using isatab2json

djcomlab commented 7 years ago

When converting the ISA-Tab of MTBLS2 (http://www.ebi.ac.uk/metabolights/MTBLS2), I noticed that the JSON produced only has one Mass Spectrometry process in the assay process sequence (Mass_Spectrometry1) with no name, and it lists all of the assay data files in the outputs from this single process.

In this case there is supposed to be one MS process per sample (as the assay file shows MS processes uniquely named in the MS Assay Name column, on each line), with one data file each as output.

djcomlab commented 7 years ago

Looking at MTBLS2 in the a_mtbl2_metabolite profiling_mass spectrometry 3.txt file, I can see that there are no values given in the Labeled Extract Name and Label columns on the Chromatography node.

It looks like this is causing unexpected pooling as the generated JSON from isatab2json only produces one Chromatographyprocess #process/Chromatography1.

djcomlab commented 7 years ago

After discussion with @proccaserra, it was determined that the assumption should be that if there is no Labeled Extract Name provided, the row should be parsed 1-1 directly across with no pooling. Pooling should occur only if there are Labeled Extract Name values provided that combine by having the same value in distinct rows.

This should apply to other node types that use Name column disambiguations.

djcomlab commented 7 years ago

Try something like adding an artificially generated column Parameter Value[run order] with unique values to Chromatography, so the parser can disambiguate Chromatography events.

djcomlab commented 7 years ago

After further discussions with the @agbeltran @proccaserra, we note that the algorithm that builds the process sequence works to detect each process event by looking for differences in Parameter Values as well as extra columns such as ___ Name columns and so actually in this MTBLS2 example the error should not in the parser, but rather in how the data is coded.

We should try and detect early pooling in the graphs with the ISA-Tab validator to warn of these cases (new issue opened for validator to detect this in #160).

However after modifying MTBLS2's assay table file by adding a Parameter Value[curation run order] with unique values (a count) to disambiguate the process events, the conversion still creates the same output with a single Chromatography process in the process sequence.

djcomlab commented 7 years ago

Further testing confirms that the way to encode the ISA-Tab so it does not split on Chromatography is to put unique values in the Labeled Extract Name column.

I tried both adding a Parameter Value[curation run order] column and also using the Label with unique values and in both these cases the pooling still occurs in the JSON on a single process #process/Chromatography1.

djcomlab commented 7 years ago

Currently, we plan to implement feature in #160 to see how many occurrences of the pooling problem we can detect in MetaboLights ISA-Tabs to estimate how much work it would take to update ISA-Tabs on their side.

djcomlab commented 7 years ago

Pooling test run on MetaboLights reveals many examples of possible pooling, the most common such as:

Possible process pooling detected on:  #process/Chromatography1
Possible process pooling detected on:  #process/Extraction1
Possible process pooling detected on:  #process/Data_transformation1
Possible process pooling detected on:  #process/Metabolite_identification1

On the data transformation and metabolite identification, I believe the pooling is expected (as it converges on a MAF file.

Where Chromatography and Extraction processes seem to pool seems to indicate that there's no values in the __ Name column and the rest of the parameters are all the same, and so the ISA-Tab parser interprets this as one process. In our tests, we ran this test on MTBLS1-MTBLS156 and revealed these kinds of pooling in 115 (Extraction) and 33 (Chromatography) instances.

proccaserra commented 7 years ago

yes, pooling expected for data transformation.

so the parser does not understand: i. 'optional' Labeled Extract Name if the header is present but the column is empty (when using a configuration which indicate that such material node may be present. What happens if a config has not such node? ii. protocol chains where input and output are all distincts (n input, n output). In this situation, i would have thought a 'run order' or 'date' field would be able to indicate distinct events.

did the tests you ran also tested different configuration layouts?

djcomlab commented 7 years ago

The test does not consider configurations. All it does is converts the ISA-Tab in MetaboLights to ISA JSON using the isatab2json converter (which uses the ISA-Tab parser), then we load the JSON into the model that in-turn calculates the real graph to report the pooling.

We already talked about using some kind of Parameter Value to add something like a run order or process ID (like a meaningless hash of some sort), although we noticed that PVs are not currently used in to determine uniqueness in the parser (Alex I think is doing a fix for this), but I think qualifiers (data and performer) and in/outputs are.

If the optional Name column is not present, it still pools (I tried on MTBLS1).

ISA-tools / isa-api

Unexpected pooling on MS process in JSON output using isatab2json #146