cerebis / qc3C

Reference-free quality assessment for Hi-C sequencing data
GNU Affero General Public License v3.0
12 stars 1 forks source link

Parsing of BAM header when file is expected to be readname-sorted. #58

Closed agilly closed 2 years ago

agilly commented 2 years ago

Hello,

I'm trying to run qc3C bam and as the doc specifies, I sorted my input bam by name:

samtools sort -t 32 -n -o $libname.aligned.sorted.bam $libname.aligned.bam

Then I run:

qc3C bam -k arima --fasta hg38.p13.fa.gz --bam $libname.aligned.sorted.bam --output-path $outpath -t 10

That produces an error:

[...]
INFO     | 2022-01-24 04:33:24,898 | qc3C.bam_based | Found 1,237,255,262 alignments to analyse
[E::idx_find_and_load] Could not retrieve index file for 'masked_filename.aligned.sorted.bam'
ERROR    | 2022-01-24 04:33:24,904 |    main | masked_filename.aligned.sorted.bam must be sorted by name

This is strange because it seems something in the code is looking for a bam index, which is not possible to generate for a read-name-sorted BAM.

JonEilers commented 2 years ago

I am running into the same error using the conda version of qc3C

qc3C bam --enzyme DpnII --fasta /home/jon/Working_Files/s_chloronotus/hapo-g/round_3/hapog_results/hapog.fasta --bam /home/jon/Working_Files/s_chloronotus/bwa-mem/aligned.SRR8499559.bam --max-obs 100000 --output-path /home/jon/Working_Files/s_chloronotus/qc3c/SRR8499559_w_bam/
INFO     | 2022-01-28 11:20:27,247 | qc3C.ligation | Loading cached cut-site database
INFO     | 2022-01-28 11:20:27,422 | qc3C.bam_based | Accepting all usable reads
INFO     | 2022-01-28 11:20:27,423 | qc3C.utils | Random seed was not set, using 1383103
INFO     | 2022-01-28 11:20:27,423 | qc3C.bam_based | Beginning analysis...
Pairs:   0%|                                                                                 | 0/100000 [00:00<?, ?it/s][E::idx_find_and_load] Could not retrieve index file for '/home/jon/Working_Files/s_chloronotus/bwa-mem/aligned.SRR8499559.bam'
ERROR    | 2022-01-28 11:20:27,429 |    main | 'HD'
Pairs:   0%|         
cerebis commented 2 years ago

Hi @agilly,

There appears to be something strange happening here. One expected and one unexpected.

First issue

The warning emitted by HTSlib about a missing index is known and spurious when it comes to query-name sorted files. There is yet to be a fix in PySAM to address how the underlying HTSlib API is called. See https://github.com/pysam-developers/pysam/issues/939 for more detail.

I may suppress this message, as I had been hoping PySAM would be modified by now to avoid raising it.

Second issue

qc3C makes an explicit test for the sorting type. This is done by inspecting the 'HD' record within BAM header, and it would appear that your HD record does not specify query-name sorting.

Could you please provide me with that string?

ie.

samtools view -H {your bamfile} | grep '^@HD'

An example of the header string is as follows.

@HD VN:1.6  SO:queryname

I suspect your header does not contain an 'SO' tag-field or the value associated deviates from 'queryname'.

In either case, it will help to see what you have and -- importantly -- if you could also describe to me the workflow which creates the BAM that goes into qc3C.

cerebis commented 2 years ago

partially addressed by commit f1cc153

agilly commented 2 years ago

Hi Matthew,

thanks for your reply. Indeed I can confirm that my bam file does not contain the expected header:

$> samtools view -H L79851_Track-123454.aligned.sorted.bam| grep '^@HD'
@HD     VN:1.6  SO:unknown

The bam files were generated as follows:

samtools sort -n [aligned.bam] -o [aligned.sorted.bam]

However, the logs for that particular step are missing so I will regenerate just to be sure.

agilly commented 2 years ago

UPDATE: Running the above command gives me :

@HD     VN:1.6  SO:queryname

Going back to my command, I used -t 32 to determine threads, but that was actually a mistake. From the help:

  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)

I confirmed that the first million read names in both files is the same, so perhaps -t 32 did not actually change the -n sort, but it must have caused the SO to turn to unknown.

cerebis commented 2 years ago

No worries. I always find it curious that samtools went with -@ n for thread count. This is the only suite of programs I am aware of that has elected to use that character. Perhaps it was a brilliant choice, but being the odd-man-out can lead to issues just like you've experienced.

cerebis commented 2 years ago

I am going to reopen this issue, with the intention of reporting the sort order if found as part of the error.