Closed ptth222 closed 2 years ago
This could add empty tokens into the list comprehension. But I am not sure if this is a bad thing or not. Need to see the actual code.
I thought the ' if token != "" ' would prevent that. I have revised.
Old: headerSplitter = re.compile(r'[+]|(r?"[^\"]"|r?'[^\']')|\s+') [ token for token in re.split(headerSplitter, headerTagDescription["header"]) if token != "" and token != None ]
New: headerSplitter = re.compile(r'[+]|(r?\"[^\"]\"|r?\'[^\']\')') [ strippedToken for token in re.split(headerSplitter, headerTagDescription["header"]) if token != None and (strippedToken := token.strip()) != ""]
Tests: header = "Project Subject ID" header = 'Compound+"-13C"+C_isomers+"-"+SamplID' header = ' Compound + "-13C" + C_isomers + "-" + SamplID ' header = " "
Old Results: ['Project', 'Subject', 'ID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] []
New Results: ['Project Subject ID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] []
They are different only for the one where it needs to be. Are there any other cases I am not thinking of?
Nice use of a walrus assignment operator!
I now roughly remember pondering how to implement this parsing and tried a couple approaches. I had compromised with the implementation, because I did not want to use two comprehensions or call .strip() twice.
Your implementation with the walrus assignment operator is very clean.
On Wed, Oct 5, 2022 at 2:54 AM ptth222 @.***> wrote:
I thought the ' if token != "" ' would prevent that. I have revised.
Old: headerSplitter = re.compile(r'[+]|(r?"[^\"]"|r?'[^\']')|\s+') [ token for token in re.split(headerSplitter, headerTagDescription["header"]) if token != "" and token != None ]
New: headerSplitter = re.compile(r'[+]|(r?"[^\"]"|r?'[^\']')') [ strippedToken for token in re.split(headerSplitter, headerTagDescription["header"]) if token != None and (strippedToken := token.strip()) != ""]
Tests: header = "Project Subject ID" header = 'Compound+"-13C"+C_isomers+"-"+SamplID' header = ' Compound + "-13C" + C_isomers + "-" + SamplID ' header = " "
Old Results: ['Project', 'Subject', 'ID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] []
New Results: ['Project Subject ID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] ['Compound', '"-13C"', 'C_isomers', '"-"', 'SamplID'] []
They are different only for the one where it needs to be. Are there any other cases I am not thinking of?
— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/MESSES/issues/3#issuecomment-1268024703, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B4EB5TZZHYFOU7TB7LWBUQ4FANCNFSM6AAAAAAQ22N6MY . You are receiving this because you commented.Message ID: @.***>
Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093
Implemented and tested. 43ac751b7834807d4a2665e73dc566d9a89f1966
In the automation tags if you try to match to a header that has spaces in it it won't work.
Example:
tags #header #tag.add
This will not match to a "Parent Subject ID" header in the data. It ends up looking for a "Parent", "Subject", and "ID" headers instead.
The issue is in cythonized_tagSheet.pyx. Line 75 and 15 being of interest. headerSplitter = re.compile(r'[+]|(r?\"[^\"]\"|r?\'[^\']\')|\s+') [ token for token in re.split(headerSplitter, headerTagDescription["header"]) if token != "" and token != None ]
Choosing to split on spaces is what is causing the issue. There is a fairly simple workaround of wrapping the header in a regex.
Example:
tags #header #tag.add
This will now match.
I have made a note of this in the documentation, but the question is whether we should do some work to try and make this match without needing the regex. I feel like users are going to expect this to work without needing a regex and get confused and frustrated like I did when it doesn't.
I think something like: headerSplitter = re.compile(r'[+]|(r?\"[^\"]\"|r?\'[^\']\')') [ token.strip() for token in re.split(headerSplitter, headerTagDescription["header"]) if token != "" and token != None ]
will fix the issue. I just can't be sure it doesn't break some other header formulations. I think the space in the splitter is just to let people put spaces between the concatenation operators though, and the strip() should still allow that.