PacificBiosciences / pb-human-wgs-workflow-snakemake

DEPRECATED - Workflow for the comprehensive detection and prioritization of variants in human genomes with PacBio HiFi reads
BSD 3-Clause Clear License
38 stars 19 forks source link

Alignment directory creation issue: Possibly related to needing index files for bams #132

Closed lauragails closed 1 year ago

lauragails commented 1 year ago

Hello, Happy to report that I'm chugging along! I am running process_smrtcells via process_smrtcells.local.sh

It has been running overnight, and it's possible that all is well! But I am concerned that things will go awry because aligned directories haven't yet been created ie: samples/{sample}/aligned/.

edit I only see the following directories in any {sample}/aligned: benchmarks jellyfish logs smrtcell_stats

And I do see smrtcell_stats output (read length and quality summaries), which I didn't expect yet. The reason why I'm curious is in config.yaml it says that "stats" and "coverage_qc" require alignment. The output for this, and for jellyfish, seems to be working fine, so I know the bams are being read in OK.

A possibly related question is: Is it necessary to feed .bam and .bam.bai into the pipeline? I don't have the index files, but I can index them + resubmit. I now see how the aligned bam.bai might not be added to the targets loop in python.

Here is an example of a logfile that is showing a warning about the missing index.

cat samples/PBG_3432_0000009206/logs/smrtcell_stats/m00000_000000_000000.log
[E::idx_find_and_load] Could not retrieve index file for 'smrtcells/ready/PBG_3432_0000009206/m00000_000000_000000.hifi_reads.bam'

Thank you!

williamrowell commented 1 year ago

Do you see any occurrence of Error in the top level snakemake log? I think it's likely that the alignment jobs haven't started yet.

But I do see smrtcell_stats output (read length and quality summaries), which I didn't expect yet. The reason why I'm curious is in config.yaml it says that "stats" and "coverage_qc" require alignment. The output for this, and for jellyfish, seems to be working fine, so I know the bams are being read in OK.

It sounds like it just ran the smrtcell_stats and jellyfish jobs before alignment. Neither of these require alignment. If the workflow is still running, it's likely that alignments haven't been created yet.

I'll fix the comment for smrtcell_targets->stats (#133).

A possibly related question is: Is it necessary to feed .bam and .bam.bai into the pipeline? I don't have the index files, but I can index them + resubmit. I now see how the aligned bam.bai might not be added to the targets loop in python.

Unaligned BAMs cannot be indexed as bai, and we don't require indices for any downstream steps. This error message from pysam/htslib should just be a warning instead, as it doesn't prevent the script from completing.

lauragails commented 1 year ago

Perfect. No I don't see any evidence of an "Error" - willing to bet the pipeline just didn't get to alignments.

Makes sense RE unaligned bams. Thank you as always for the unbelievably fast responses!

williamrowell commented 1 year ago

Side note for posterity's sake: PacBio does generate an index for unaligned BAMs in the .pbi format, but it isn't necessary for this workflow. Since we plan on acting on all reads and we don't care about the order in which we access them, indices are unnecessary.

The .pbi index can be useful for some other PacBio tools, or for trying to access reads quickly from arbitrary ZMWs.