Closed ptth222 closed 1 year ago
This is a solid solution! Future proofs with respect to changes in pandas.
Would also handle other special character issues as well.
On Mon, Nov 14, 2022 at 3:03 PM ptth222 @.***> wrote:
Due to the possible variations in newline characters (\r\n, vs \n, vs \r) unit tests were sometimes breaking because pandas unpredictably sanitizes the newlines. By default Python normalizes everything to \n on reads, but pandas is not consistent and some versions will and some won't.
As suggest by Hunter I have added a command line option that allows users to remove characters using a string or regex. The default is to remove both unicode characters and \r. This is because if a \r makes it into an Excel file it will be read in by pandas as x000D. The specific regex is " x([0-9a-fA-F]{4})|\r". The option is "--file-processing". It is implemented in TagParser's loadSheet method. Specific changes can be seen here https://github.com/MoseleyBioinformaticsLab/MESSES/blob/Travis/src/messes/extract.py. I added new tests and everything is passing.
What do you think Hunter?
— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/MESSES/issues/11, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B5LL7GP7YQ33HUWR2TWIKLKPANCNFSM6AAAAAASAF7KCI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093
Due to the possible variations in newline characters (\r\n, vs \n, vs \r) unit tests were sometimes breaking because pandas unpredictably sanitizes the newlines. By default Python normalizes everything to \n on reads, but pandas is not consistent and some versions will and some won't.
As suggest by Hunter I have added a command line option that allows users to remove characters using a string or regex. The default is to remove both unicode characters and \r. This is because if a \r makes it into an Excel file it will be read in by pandas as x000D. The specific regex is "x([0-9a-fA-F]{4})|\r". The option is "--file-processing". It is implemented in TagParser's loadSheet method. Specific changes can be seen here https://github.com/MoseleyBioinformaticsLab/MESSES/blob/Travis/src/messes/extract.py. I added new tests and everything is passing.
What do you think Hunter?