broadinstitute / long-read-pipelines

Long read production pipelines
https://broadinstitute.github.io/long-read-pipelines/
BSD 3-Clause "New" or "Revised" License
140 stars 25 forks source link

PBUtils.GetRunInfo needs to sanitize its inputs. #163

Closed jonn-smith closed 4 years ago

jonn-smith commented 4 years ago

I ran into an issue where PBUtils.GetRunInfo is producing some garbage values in the fields it creates based on the input here: gs://broad-dsde-methods-long-reads/covid-19-aziz/small_test_set/

I created that file as a small test set before running the whole dataset again. My processing of the data may have something to do with why the info is getting mangled.

run_info.txt:

ID  44eaa76d
CN  BI
DT  2020-04-30T18:03:00.352Z -t 4 /cromwell_root/broad-dsde-methods-long-reads/covid-19-aziz/Homo_sapiens_assembly38_SARS_CoV-2.fasta -
PL  PACBIO
PM  SEQUELII
PU  m64020_200501_004434
SM  covid_lr_pilot_1\
DS  READTYPE=CCS;BINDINGKIT=101-789-500;SEQUENCINGKIT=101-826-100;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000
SO  unknown
DA  gs://broad-dsde-methods-long-reads/covid-19-aziz/small_test_set/m64020_200501_004434.subset_5000.subreads.bam,gs://broad-dsde-methods-long-reads/covid-19-aziz/small_test_set/workflow_results/covid_lr_pilot_1.m64020_200501_004434/20200811_160817_710874908/aligned_array_elements/covid_lr_pilot_1.m64020_200501_004434.bam,gs://broad-dsde-methods-long-reads/covid-19-aziz/small_test_set/workflow_results/covid_lr_pilot_1.m64020_200501_004434/20200811_160817_710874908/annotated_aligned_array_elements/covid_lr_pilot_1.m64020_200501_004434.bam,gs://broad-dsde-methods-long-reads/covid-19-aziz/small_test_set/workflow_results/covid_lr_pilot_1.m64020_200501_004434/20200811_160817_710874908/merged_bams/aligned/covid_lr_pilot_1.m64020_200501_004434.bam,gs://broad-dsde-methods-long-reads/covid-19-aziz/small_test_set/workflow_results/covid_lr_pilot_1.m64020_200501_004434/20200811_160817_710874908/merged_bams/unaligned/covid_lr_pilot_1.m64020_200501_004434.bam,gs://broad-dsde-methods-long-reads/covid-19-aziz/small_test_set/workflow_results/covid_lr_pilot_1.m64020_200501_004434/20200811_192650_713044861/aligned_array_elements/covid_lr_pilot_1.m64020_200501_004434.bam,gs://broad-dsde-methods-long-reads/covid-19-aziz/small_test_set/workflow_results/covid_lr_pilot_1.m64020_200501_004434/20200811_192650_713044861/annotated_aligned_array_elements/covid_lr_pilot_1.m64020_200501_004434.bam,gs://broad-dsde-methods-long-reads/covid-19-aziz/small_test_set/workflow_results/covid_lr_pilot_1.m64020_200501_004434/20200811_192650_713044861/merged_bams/aligned/covid_lr_pilot_1.m64020_200501_004434.bam,gs://broad-dsde-methods-long-reads/covid-19-aziz/small_test_set/workflow_results/covid_lr_pilot_1.m64020_200501_004434/20200811_192650_713044861/merged_bams/unaligned/covid_lr_pilot_1.m64020_200501_004434.bam
PN  minimap2
VN  2.17-r941
CL  minimap2 -ayYL --MD --eqx -x splice -R @RG
CM  S/P4-C2/5.0-8M
TY  BAM

Note the values of the DT and SM fields.

DT The DT field should not have the -t 4 /cromwell_root/broad-dsde-methods-long-reads/covid-19-aziz/Homo_sapiens_assembly38_SARS_CoV-2.fasta - text in it. Its presence was causing the MiniMap2 alignment tasks to fail in an unexpected way - it was running out of memory because it thought the reference was a FASTA file to align (it was effectively being given that text multiple times as arguments).

SM The SM field should not end in a trailing backslash. This caused MergeBams tasks to fail when they looked for bam indices with the literal \ character in them.

I suspect that all of the parsing this is doing may suffer from similar issues.

kvg commented 4 years ago

Uhoh, that's no good. I'll fix this first thing tomorrow.

jonn-smith commented 4 years ago

Yeah - I made some workarounds in my workflows to get past it. I didn't look at the code for the tool, so I'm not sure what exactly it's grabbing that it's not expecting.

I'm not moving the test files, so they'll be there for you to test with.

kvg commented 4 years ago

Fixed and merged, but see https://github.com/broadinstitute/long-read-pipelines/issues/174 for future considerations on this task.