Incorrect fastq chunk with sample_sheet input

LuJiansen commented 3 months ago

Operating System

CentOS 7

Other Linux

No response

Workflow Version

v1.2.0

Workflow Execution

Command line (Local)

Other workflow execution

No response

EPI2ME Version

No response

CLI command run

nextflow run epi2me-labs/wf-pore-c \ --fastq 'raw_data' \ --sample_sheet 'sample_sheet.csv' \ --cutter 'MboI' \ --threads 20 \ --chunk_size 20000 \ --mcool \ --coverage \ --chromunity \ --pairs \ --vcf '/path/to/illumina_PlatinumGenomes_2017_hg38_NA12878_PS.vcf.gz' \ --phased_vcf '/path/to/illumina_PlatinumGenomes_2017_hg38_NA12878_PS.vcf.gz' \ --ref '/path/to/GRCh38.fa' \ -profile singularity -resume

Workflow Execution - CLI Execution Profile

None

What happened?

The chunk index for bamindex fetch in process 'digest_align_annotate' was defined as task.index - 1, which is working when using single sample mode. While when there are multiple samples, the task.index of the process will be cumulatively calculated acorss samples.

For example, given that there 3 samples barcode01, barcode02 and barcode03, each have 6 chunks. The task.index of 'digest_align_annotate' process will ranged from 1 to 18 when running with sample_sheet, making bamindex try to fetch chunk larger than 6, and lead to segmentation fault (core dump)

Relevant log output

Command exit status:
  1

Command output:
  6e907578-df25-4728-a3c8-841cfdd1f3e2

Command error:
  6e907578-df25-4728-a3c8-841cfdd1f3e2
  [14:40:33 - Digest    ] Digesting concatemers from [<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>].
  [14:40:33 - AnntateBAM] Processing reads from -
  Traceback (most recent call last):
    File "/opt/custflow/epi2meuser/conda/bin/pore-c-py", line 10, in <module>
      sys.exit(run_main())                                                                                File "/opt/custflow/epi2meuser/conda/lib/python3.8/site-packages/pore_c_py/main.py", line 408, in run_main
      args.func(args)                                                                                     File "/opt/custflow/epi2meuser/conda/lib/python3.8/site-packages/pore_c_py/main.py", line 249, in digest_bam
      with pysam.AlignmentFile(
    File "pysam/libcalignmentfile.pyx", line 748, in pysam.libcalignmentfile.AlignmentFile.__cinit__
    File "pysam/libcalignmentfile.pyx", line 953, in pysam.libcalignmentfile.AlignmentFile._open
  ValueError: file does not contain alignment data
  [M::bam2fq_mainloop] discarded 0 singletons
  [M::bam2fq_mainloop] processed 0 reads
  [M::main::8.973*0.93] loaded/built the index for 194 target sequence(s)
  [M::mm_mapopt_update::10.571*0.94] mid_occ = 705
  [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 194                                        [M::mm_idx_stat::11.444*0.94] distinct minimizers: 100159079 (38.79% are singletons); average occurrences: 5.540; average spacing: 5.586; total length: 3099750718
  [M::main] Version: 2.28-r1209
  [M::main] CMD: minimap2 -ay -t 13 -x map-ont --cap-kalloc 100m --cap-sw-mem 50m reference.fasta.mmi -
  [M::main] Real time: 11.707 sec; CPU: 11.064 sec; Peak RSS: 7.729 GB
  [14:40:44 - AnntateAln] Found 0 monomers in 0 concatemers.
  [14:40:44 - AnntateBAM] Finished BAM parsing.

Work dir:                                                                                               /data/mini_test/work/7b/311fc97ca33eebd0ef2aaacc51e3d0                                                                                                                                                                                Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`                                                                                                                                                                                                  -- Check '.nextflow.log' file for details

$ cat /data/mini_test/work/7b/311fc97ca33eebd0ef2aaacc51e3d0/.command.sh
#!/bin/bash -euo pipefail
echo "6e907578-df25-4728-a3c8-841cfdd1f3e2"
bamindex fetch --chunk=17 "concatemers.bam" |
    pore-c-py digest "MboI" --max_monomers 250 --excluded_list "filtered_reads.txt"                 --header "concatemers.bam"                 --threads 2 |
samtools fastq --threads 1 -T '*' |
minimap2 -ay -t 13 -x map-ont --cap-kalloc 100m --cap-sw-mem 50m                 "reference.fasta.mmi" - |
pore-c-py annotate - "P94B1" --monomers                 --threads 2  --stdout  --chromunity --chromunity_merge_distance -1 --summary  |             tee "P94B1_out.ns.bam" |
samtools sort -m 1G --threads 2  -u --write-index -o "P94B1.cs.bam" -

NOTE that the concatemers.bam only have 6 chunks, while bamindex try to fetch chunk 17

Application activity log entry

No response

Were you able to successfully run the latest version of the workflow with the demo data?

yes

Other demo data information

No response

sarahjeeeze commented 3 months ago

Hi, thanks for pointing this out. We will update this so it uses a seperate chunk variable and let you know when that is done.

Blosers commented 2 months ago

您好，请问您是如何运行该软件的，因为我也是刚刚接触到linux，所以在很多方面都不知道该如何处理，我可以向您请教一下如何设置该软件吗？我已经下载了.sif软件，也设置了离线运行，但是在运行的时候依旧会从网上拉去镜像，所以很抱歉打扰您了，我想请问一下我该如何设置我的参数。非常感谢您可以在百忙之中抽空解答一下我的疑问，谢谢。

sarahjeeeze commented 2 months ago

Hi this is fixed in the most recent version

epi2me-labs / wf-pore-c