MoseleyBioinformaticsLab / MESSES

MESSES (Metadata from Experimental SpreadSheets Extraction System) is a Python package that facilitates the conversion of tabular data into other formats.
https://moseleybioinformaticslab.github.io/MESSES/
Other
0 stars 0 forks source link

Automation headers can't have spaces #3

Closed ptth222 closed 2 years ago

ptth222 commented 2 years ago

In the automation tags if you try to match to a header that has spaces in it it won't work.

Example:

tags #header #tag.add

          Parent Subject ID    #sample.id

This will not match to a "Parent Subject ID" header in the data. It ends up looking for a "Parent", "Subject", and "ID" headers instead.

The issue is in cythonized_tagSheet.pyx. Line 75 and 15 being of interest. headerSplitter = re.compile(r'[+]|(r?\"[^\"]\"|r?\'[^\']\')|\s+') [ token for token in re.split(headerSplitter, headerTagDescription["header"]) if token != "" and token != None ]

Choosing to split on spaces is what is causing the issue. There is a fairly simple workaround of wrapping the header in a regex.

Example:

tags #header #tag.add

          r'Parent Subject ID'    #sample.id

This will now match.

I have made a note of this in the documentation, but the question is whether we should do some work to try and make this match without needing the regex. I feel like users are going to expect this to work without needing a regex and get confused and frustrated like I did when it doesn't.

I think something like: headerSplitter = re.compile(r'[+]|(r?\"[^\"]\"|r?\'[^\']\')') [ token.strip() for token in re.split(headerSplitter, headerTagDescription["header"]) if token != "" and token != None ]

will fix the issue. I just can't be sure it doesn't break some other header formulations. I think the space in the splitter is just to let people put spaces between the concatenation operators though, and the strip() should still allow that.

hunter-moseley commented 2 years ago

This could add empty tokens into the list comprehension. But I am not sure if this is a bad thing or not. Need to see the actual code.

ptth222 commented 2 years ago

I thought the ' if token != "" ' would prevent that. I have revised.

Old: headerSplitter = re.compile(r'[+]|(r?"[^\"]"|r?'[^\']')|\s+') [ token for token in re.split(headerSplitter, headerTagDescription["header"]) if token != "" and token != None ]

New: headerSplitter = re.compile(r'[+]|(r?\"[^\"]\"|r?\'[^\']\')') [ strippedToken for token in re.split(headerSplitter, headerTagDescription["header"]) if token != None and (strippedToken := token.strip()) != ""]

Tests: header = "Project Subject ID" header = 'Compound+"-13C"+C_isomers+"-"+SamplID' header = ' Compound + "-13C" + C_isomers + "-" + SamplID ' header = " "

Old Results: ['Project', 'Subject', 'ID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] []

New Results: ['Project Subject ID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] []

They are different only for the one where it needs to be. Are there any other cases I am not thinking of?

hunter-moseley commented 2 years ago

Nice use of a walrus assignment operator!

I now roughly remember pondering how to implement this parsing and tried a couple approaches. I had compromised with the implementation, because I did not want to use two comprehensions or call .strip() twice.

Your implementation with the walrus assignment operator is very clean.

On Wed, Oct 5, 2022 at 2:54 AM ptth222 @.***> wrote:

I thought the ' if token != "" ' would prevent that. I have revised.

Old: headerSplitter = re.compile(r'[+]|(r?"[^\"]"|r?'[^\']')|\s+') [ token for token in re.split(headerSplitter, headerTagDescription["header"]) if token != "" and token != None ]

New: headerSplitter = re.compile(r'[+]|(r?"[^\"]"|r?'[^\']')') [ strippedToken for token in re.split(headerSplitter, headerTagDescription["header"]) if token != None and (strippedToken := token.strip()) != ""]

Tests: header = "Project Subject ID" header = 'Compound+"-13C"+C_isomers+"-"+SamplID' header = ' Compound + "-13C" + C_isomers + "-" + SamplID ' header = " "

Old Results: ['Project', 'Subject', 'ID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] []

New Results: ['Project Subject ID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] []

They are different only for the one where it needs to be. Are there any other cases I am not thinking of?

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/MESSES/issues/3#issuecomment-1268024703, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B4EB5TZZHYFOU7TB7LWBUQ4FANCNFSM6AAAAAAQ22N6MY . You are receiving this because you commented.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

ptth222 commented 2 years ago

Implemented and tested. 43ac751b7834807d4a2665e73dc566d9a89f1966