Issue with longest_path logic for determining ISA-Tab columns for output

bobular commented 7 years ago

Hi all, I'm the main developer of the Perl Bio::Parser::ISATab module. For ages it has been read-only, but I'm finally working on write functionality.

It's a real challenge to figure out which columns you need to output from the internal data structure - especially as their order is important, and many columns have the same heading (e.g. Term Source REF). So I had a look at the python API code to see if I could get any inspiration.

I saw the use of longest_path_and_attrs at this line https://github.com/ISA-tools/isa-api/blob/master/isatools/isatab.py#L500 and figured that this would trip up if there's no single unambiguous row of data that contains non-empty data for all the columns we need to output.

So I faked some data in BII-S-1/s_BII-S-1.txt - I each line have EITHER a non-empty value for Characteristics[strain] OR Characteristics[genotype] but never values for both. Then I loaded and dumped the ISA-Tab with this code:

from isatools import isatab
with open('./ISAdatasets-master/tab/BII-I-1/i_investigation.txt', 'r', encoding='utf-8') as fp:
  ISA=isatab.load(fp)
  isatab.dump(isa_obj=ISA, output_path='./output', i_file_name='i_investigation.txt')

The output isn't quite what I expected (both columns are there twice!) but it seems to have found a potential issue. Some of the column names in the output are suffixed by 0 or 1 as shown below:

Source Name Characteristics[organism]0 Term Source REF0 Characteristics[strain]0 Characteristics[genotype]0 Characteristics[organism]1 Term Source REF1 Characteristics[strain]1 Characteristics[genotype]1 Protocol REF Sample Name Factor Value[limiting nutrient] Factor Value[rate] Unit

I'd appreciate any of your thoughts on this issue!

(Edited to fix code snippet markup)

djcomlab commented 7 years ago

Hi @bobular, thanks or your input - the 0s and 1s thing shouldn't really happen and it's probably because the test case you have tried isn't covered in our test suite.

I had made an assumption (maybe not the best assumption) that the longest path with nodes with the most attributes populated with values would, in most cases, be able to generate the full set of ISA-Tab table headers needed for the rest of the paths. At least with our test data this has worked so far. This assumption won't work however with the OR case you've described above.

I think I'll have to revisit this and work out how to dynamically insert new columns if they are needed as new Characteristics columns are indicated as we iterate over the list of all paths. In Python, pandas DataFrames should let me do this without too much difficulty hopefully, but I'm not sure if there's a Perl equivalent (or if you're using something like DataFrames at all).

For the Perl ISATab module, perhaps this might help: the order of the material and process nodes (so things like Sample Name, Source Name, Extract Name, but also Protocol REFs) are important and essentially act as the object identifiers, but actually according to the specification the stuff in between (Characteristics, Factor Values, Parameter Values, Comments etc.) you can think of as object attributes attached to whatever node is indicated by the node column leftwards, and don't necessarily need to be ordered. Term Source REF and Term Accession Number are however importantly associated with the leftwards indicated attribute column to indicate an Ontology annotation value type. Hope that kind of makes sense.

We've got our ISA formats specs now on ReadTheDocs also to refer to: https://isa-specs.readthedocs.io/en/latest/

Hope that helps a bit! Is the Perl parser you're working on on Github also?

bobular commented 7 years ago

Hi @djcomlab Yes you make perfect sense, and thanks for the quick reply!

Firstly yes - it's here: https://github.com/bobular/Bio-Parser-ISATab

Secondly, I agree that it doesn't matter which order the Characteristics for a particular Material node are output in, but yes of course the Term Source REF and Accession Numbers must follow their respective Characteristics or Parameter Value column (not forgetting Units!).

Although your test data doesn't have these cases in it, I can see our real world ISA-Tabs having this kind of either/or column usage, so I will see if I can come up with an algorithm and share it back to you guys. I've already tried aligning the various alternative column headings to create a consensus with my own bespoke algorithm and also a Needleman&Wunsch approach and was not happy. I'll have to build the columns for each Material/Process node separately as you suggest, I think.

Very nice job with the new ISA-Tab/JSON specs by the way! I saw them for the first time last week.

bobular commented 7 years ago

Looks like I can build a table, inserting columns as required, with this Perl module: http://search.cpan.org/~ezdb/Data-Table-1.68/Table.pm :-)

djcomlab commented 7 years ago

Great! So one thing to look out for is that some attribute columns might appear repeated - e.g. you could get the same Characteristic[something] but being attached to different nodes, and also Term Source REF and Term Accession Number the same might happen. This could make it tricky to look up the right column to place values in when you write out the values from the paths.

So what I do is build the column headers to make sure they labels are unique by doing something like Sample Name.Characteristic[Something] and Sample Name.Characteristic[Something].Term Source REF etc., so that when placing the path values in rows it's easily looked up, then before the final write out to disk I normalize the header to valid ISA-Tab headings, so e.g. Sample Name.Characteristic[Something] becomes Characteristic[Something] and Sample Name.Characteristic[Something].Term Source REF becomes Term Source REF and so on.

bobular commented 7 years ago

Awesome, and thanks, that sounds like a good strategy.

bobular commented 7 years ago

Yeah..! So I have made good progress this afternoon. I think I have nailed the s_samples.txt sheet output using your "temporary verbose column names" approach and the well-provisioned Data::Table library. The same code should also work for assay sheets, but I will address that tomorrow.

https://github.com/bobular/Bio-Parser-ISATab/commit/6378c75a52cae9bfd4b850e872fc962b04639675

djcomlab commented 7 years ago

Awesome, thanks - I'll take a look!

ISA-tools / isa-api

Issue with longest_path logic for determining ISA-Tab columns for output #207