MoseleyBioinformaticsLab / MESSES

MESSES (Metadata from Experimental SpreadSheets Extraction System) is a Python package that facilitates the conversion of tabular data into other formats.
https://moseleybioinformaticslab.github.io/MESSES/
Other
0 stars 0 forks source link

Newline Mismatching #11

Closed ptth222 closed 1 year ago

ptth222 commented 1 year ago

Due to the possible variations in newline characters (\r\n, vs \n, vs \r) unit tests were sometimes breaking because pandas unpredictably sanitizes the newlines. By default Python normalizes everything to \n on reads, but pandas is not consistent and some versions will and some won't.

As suggest by Hunter I have added a command line option that allows users to remove characters using a string or regex. The default is to remove both unicode characters and \r. This is because if a \r makes it into an Excel file it will be read in by pandas as x000D. The specific regex is "x([0-9a-fA-F]{4})|\r". The option is "--file-processing". It is implemented in TagParser's loadSheet method. Specific changes can be seen here https://github.com/MoseleyBioinformaticsLab/MESSES/blob/Travis/src/messes/extract.py. I added new tests and everything is passing.

What do you think Hunter?

hunter-moseley commented 1 year ago

This is a solid solution! Future proofs with respect to changes in pandas.

Would also handle other special character issues as well.

On Mon, Nov 14, 2022 at 3:03 PM ptth222 @.***> wrote:

Due to the possible variations in newline characters (\r\n, vs \n, vs \r) unit tests were sometimes breaking because pandas unpredictably sanitizes the newlines. By default Python normalizes everything to \n on reads, but pandas is not consistent and some versions will and some won't.

As suggest by Hunter I have added a command line option that allows users to remove characters using a string or regex. The default is to remove both unicode characters and \r. This is because if a \r makes it into an Excel file it will be read in by pandas as x000D. The specific regex is " x([0-9a-fA-F]{4})|\r". The option is "--file-processing". It is implemented in TagParser's loadSheet method. Specific changes can be seen here https://github.com/MoseleyBioinformaticsLab/MESSES/blob/Travis/src/messes/extract.py. I added new tests and everything is passing.

What do you think Hunter?

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/MESSES/issues/11, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B5LL7GP7YQ33HUWR2TWIKLKPANCNFSM6AAAAAASAF7KCI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093