CDPHE-bioinformatics / CDPHE-SARS-CoV-2

Workflows and scripts for the assembly and analysis of SARS-CoV-2 whole genome tiled amplicon sequencing.
https://cdphe-bioinformatics.github.io/CDPHE-SARS-CoV-2/
GNU General Public License v3.0
5 stars 0 forks source link

Invalid escape sequence in newer versions of python #31

Open danpolanco opened 5 months ago

danpolanco commented 5 months ago

Describe the bug

The findall regex either changed in newer versions of Python or this has always been incorrect:

https://github.com/CDPHE-bioinformatics/CDPHE-SARS-CoV-2/blob/f3b93dd3972b1378329810fa4a81d87a26afcdfa/scripts/concat_seq_metrics_and_lineage_results.py#L123

To Reproduce

See image in screenshots section but briefly:

  1. In Python 3.10: run the selected code with an example input like fasta_header = CDPHE-CO-123456789-0`.
  2. In Python 3.12 repeat

Expected behavior

No errors to be issued by the Python interpreter.

Screenshots

image

Additional context

N/A

danpolanco commented 5 months ago

I believe the correct way to do this in newer versions of Python is with raw strings:

re.findall(r'CO-CDPHE-([0-9a-zA-Z_\-\.]+)', fasta_header)

The change is minor as a raw string is just denoted by adding an r to the front of a string (i.e. r"string"). I'm not sure this is the correct change and welcome discussion / more research.

image

Also see The Backslash Plague.