ISA-tools / isa-api

ISA tools API
https://isa-tools.org
Other
40 stars 37 forks source link

Protocol types refactor #556

Closed ptth222 closed 2 months ago

ptth222 commented 3 months ago

When I was looking into what we discussed about #553 I found what @proccaserra said about certain columns only being added if protocol types are known. I think I have improved how it is done a little, but maybe not. I figured just doing the PR would be the easiest way to illustrate things.

The first major change is to make it where the data in the protocol-types.yml file is used more directly in the write_assay_table_files function. Previously, it was passed to a single use function, get_column_header, which repeated a lot of information that was already in the protocol_types_dict itself. Now, that function is no longer needed. This ended up causing an issue in testing though so I modified the protocol-types.yml file a bit.

Some of the protocol types in the file have headers, such as "Extract Name", which are already added to the assay file in another way in the write_assay_table_files function. This caused those headers to be repeated. I ended up simply deleting those header values and adding a check to look for null header values in the write_assay_table_files function to fix this. The only other place the protocol_types_dict is used is in the model protocol.py file to print out allowed protocol types. I'm not sure how important the header attributes are for that, so this might not be an acceptable change.

I also noticed in the process.py file in the model in the from_assay_dict method that there was a hard coded allowed_protocol_type_terms list. Initially, I thought this could also be changed to use the protocol_types_dict, but I don't think the logic using that hard coded list was actually doing anything anyway so I just removed it. It was only used to set the "name" attribute of the process, but there was already a line above it, self.name = process.get('name', ''), that would set the name regardless of protocol type.

Let me know what you guys think.

There is one more bit in the write_assay_table_files function to do with this that could possibly be improved as well, but I think it would require a more significant change. The protocol type, "nucleic acid hybridization", has special code to add 2 columns instead of 1, and I think that special code could be removed if we were willing to change the protocol_types_dict some more. If the "header" attribute were a list instead of a string then we could add the list, but then we also need a mapping from the headers to the node attribute to fill in the data for that column. Currently, they all are assumed to map to node.name with "nucleic acid hybridization" having the additional hard coded node.array_design_ref. If "header" was a dictionary that mapped the header to the attribute then the special coding for "nucleic acid hybridization" could be removed. I didn't just go on ahead and do that in this PR because I'm not too sure how important printing the allowed protocol types is. The "header" attribute already doesn't seem necessary for that purpose, so I'm not sure why it's there.

coveralls commented 3 months ago

Coverage Status

coverage: 81.214% (+0.002%) from 81.212% when pulling 945241acc1def22043b1bfba33da2599688ccbea on ptth222:protocol-types-refactor into d7aa0277b8a6a0848094cc8adbc0d5f512899bf5 on ISA-tools:issue-511.