ISA-tools / isa-api

ISA tools API
https://isa-tools.org
Other
40 stars 37 forks source link

Multiple Data Files of The Same Type Will Only Have 1 Name in Assay Conversion #509

Open ptth222 opened 8 months ago

ptth222 commented 8 months ago

If you try to create 2 files of the same type in the same assay in a JSON to Tab conversion only the last file will appear as the name in both columns. For example, if you have a Raw Data File, 'data_file1' and 'data_file2', only 'data_file2' will appear in the 2 Raw Data File columns (assuming data_file2 is later in the process sequence).

Example to reproduce:

with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/json/BII-I-1/BII-I-1.json', 'r') as jsonFile:
    isa_example = json.load(jsonFile)

## Delete process sequence for transcriptome and replace it.
del isa_example["studies"][0]["assays"][2]["processSequence"]

protocol1 = {
          "@id": "#protocol/protocol1",
          "name": "protocol1",
        }
protocol2 = {
          "@id": "#protocol/protocol2",
          "name": "protocol2",
        }
protocol3 = {
          "@id": "#protocol/protocol3",
          "name": "protocol3",
        }
isa_example["studies"][0]["protocols"].append(protocol1)
isa_example["studies"][0]["protocols"].append(protocol2)
isa_example["studies"][0]["protocols"].append(protocol3)

data_file1 = {
          "@id": "#data/data_file1",
          "name": "data_file1",
          "type": "Raw Data File"
        }
data_file2 = {
          "@id": "#data/data_file2",
          "name": "data_file2",
          "type": "Raw Data File"
        }
data_file3 = {
          "@id": "#data/data_file3",
          "name": "data_file3",
          "type": "Raw Data File"
        }
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file1)
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file2)
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file3)

data_file4 = {
          "@id": "#data/data_file4",
          "name": "data_file4",
          "type": "Raw Data File"
        }
data_file5 = {
          "@id": "#data/data_file5",
          "name": "data_file5",
          "type": "Raw Data File"
        }
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file4)
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file5)

new_process = [{
          "@id": "#process/protocol1",
          "executesProtocol": {
            "@id": "#protocol/protocol1"
          },
          "inputs": [
              {'@id': '#sample/sample-C-0.07-aliquot1'}
              ],
          "outputs": [
            {
              "@id": "#data/data_file1"
            },
          ],
          "nextProcess": {"@id": "#process/protocol2"}
        },
    {
          "@id": "#process/protocol2",
          "executesProtocol": {
            "@id": "#protocol/protocol2"
          },
          "inputs": [
              {'@id': "#data/data_file1"}
              ],
          "outputs": [
            {
              "@id": "#data/data_file2"
            },
          ],
          "previousProcess": {"@id": "#process/protocol1"},
          "nextProcess": {"@id": "#process/protocol3"}
        },
    {
          "@id": "#process/protocol3",
          "executesProtocol": {
            "@id": "#protocol/protocol3"
          },
          "inputs": [
              {'@id': "#data/data_file2"}
              ],
          "outputs": [
            {
              "@id": "#data/data_file3"
            },
          ],
          "previousProcess": {"@id": "#process/protocol2"},
        },

    {
              "@id": "#process/protocol1_1",
              "executesProtocol": {
                "@id": "#protocol/protocol1"
              },
              "inputs": [
                  {'@id': '#sample/sample-C-0.07-aliquot2'}
                  ],
              "outputs": [
                {
                  "@id": "#data/data_file4"
                },
              ],
              "nextProcess": {"@id": "#process/protocol3_1"}
            },
        {
              "@id": "#process/protocol3_1",
              "executesProtocol": {
                "@id": "#protocol/protocol3"
              },
              "inputs": [
                  {'@id': "#data/data_file4"}
                  ],
              "outputs": [
                {
                  "@id": "#data/data_file5"
                },
              ],
              "previousProcess": {"@id": "#process/protocol1_1"},
            }

    ]
isa_example["studies"][0]["assays"][2]["processSequence"] = new_process

with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing.json', 'w') as out_fp:
     json.dump(isa_example, out_fp, indent=2)

with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing.json') as file_pointer:
    json2isatab.convert(file_pointer, 'C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing/', validate_first=False)

The above example modifies the "BII-I-1" example. I basically delete the transcriptome processSequence and replace it with a simpler one.

The issue appears to be in the isatools\isatab\dump\write.py file, in the write_assay_table_files function. It is similar to issue #500 where multiple data file type column names are not being tracked. I have adjusted the code so it will track the names and the file names appear as expected. I created a PR, #510.

proccaserra commented 7 months ago

@ptth222 Thank you for the PR. However it would really work as the isatab reader and specification would be allow it.

The following would be the expected way to representing more than one output to a 'data acquisition' event.

Assay Name Raw Data File Protocol REF Data Transformation Name Derived Data File
A1 fwd_read.fastq.gz normalization DT1 deseq.tsv
A1 rev_read.fastq.gz normalization DT1 deseq.tsv

What the PR does is to generate the following output:

Assay Name Raw Data File Raw Data File Protocol REF Data Transformation Name Derived Data File
A1 fwd_read.fastq.gz rev_read.fastq.gz normalization DT1 deseq.tsv

This is not allowed and would be require changing the isatab load component.

We now need to check the initial behavior and why only the last output file is kept. This will require adding new tests to the testing suite and possibly amend the parser

ptth222 commented 7 months ago

I made new commits to #510 to address what you said. I hope it is better.

I also discovered another issue while making these changes.

There are some inconsistencies between validation and the ProcessSequenceFactory that parses things. There is a defaults.py file in the isatab module that has a list of acceptable column headers, and these are imported for use in the ProcessSequenceFactory, but aren't in the validation. The validation often uses it's own sets of column headers for each rule instead of pulling from defaults or some other unified source. I discovered this because the column name "Derived Data File" was causing a validation error that wouldn't let the conversion continue. This was in the load_table_checks function in the rules_40xx.py file and I added "Derived Data File" to the list in the function. It might be worth while to try unifying the code so it is pulling column headers from 1 unified place.