Question about minimap2 parameters in long_read_typing.py

karoliinas commented 3 months ago

Hi! I'm running long_read_typing.py for PacBio -reads for which I have extracted the HLA-regions (originally aligned data to chm13, then extracted hla with HLA*LA and ran bam2fastq). I was wondering about the alignment, as the same script is used also for typing Nanopore reads and there is only one set of minimap2 parameters: minimap2 -t {parameter.threads} -p 0.1 -N 100000 -a $ref $fq

And wondered why those recommended for pacbio aren't used? E.g: minimap2 -ax map-pb ref.fa pacbio-reads.fq or -k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200

Should instead use the script SpecHLA.sh or am I correct in only running the long_read_typing.py at this point?

Many thanks!

wshuai294 commented 3 months ago

Hi,

Thank you for the patient effort. For long-read data only, long_read_typing.py should be used instead of SpecHLA.sh. For long-read-only typing, we put effort into binning reads instead of the alignment. I believe your suggested parameter setting of Minimap2 is better. We will update it as suggested soon. Thank you so much.

karoliinas commented 3 months ago

Thanks for your insights and the great the tool! I will modify the script for my hifi data. It's low coverage (~10X) and I appreciate it's not optimal for hla typing. I'm now running both HLA*LA and SpecHLA to have more confidence in the genotypes that match.

karoliinas commented 2 months ago

Hi, I'm reopening the issue although I'm not sure whether my question falls under the same category, but likely it has something to do with alignment. I managed to genotype my data, and the results mostly agree with those from HLA*LA, which is nice. I'm plotting the depths to further evaluate the calls, and notice that HLA DPB1 has a ~500bp region in all the samples with spuriously high depth (1000-2000), whereas the data is generally ~10x. Do you have any idea why this might be?

Thanks again!

wshuai294 commented 2 months ago

Hi, I guess two possible reasons:

Is the input data short reads or long reads? If they are short reads, it might be caused by PCR duplicates which can be removed by tools such as fastp.
This region might be homologous for many HLA genes. Maybe you can extract this ~500bp region, and map it to chr6 to see the hits. Or you can use online ncbi-blast to map it to the nr database. Also, you can observe the reads mapped to this region, and check if they only map to this region or they can map to other regions.

Hope it can help.

Best, Shuai

karoliinas commented 2 months ago

Hi, thanks for the insights! I have been observing the alignment files outputted by specHLA (DPB1.bam), and the read quality and mapping seem fine to me.

I have long read data (pacbio hifi), and therefore only ran the long_read_typing.py. Does it also perform read binning, or only alignment? I was thinking could the problem arise in the binning step?

Best, Karoliina

Thanks for the help!

wshuai294 commented 2 months ago

Hi,

SpecHLA indeed bins the long reads as well. Please look at the file ${sample}.db.sam in the output folder, which records what alleles each read can map to.

Best, shuai

karoliinas commented 2 months ago

Right, first of all, it's great to have all the outputs for troubleshooting! Seems that there's a lot going on under the hood. So for most of the samples, the file ${sample}.db.sam has roughly 3M reads altogether, and ~900K reads are assigned for DPB1. Does that sound like a lot?

For your information, I did change the mapping parameters to suit the hifi -reads because the original got stuck in the mapping step for a long time:

under class Pacbio_Binning: minimap2 -x map-hifi -t {parameter.threads} -p 0.1 -N 100000 -a $ref $fq > $outdir/$sample.db.sam and under class Fasta: minimap2 -x map-hifi -t %s -a $hla_ref $outdir/$hla.%s.fq.gz | samtools view -bS -F 0x800 -| samtools sort - >$outdir/$hla.bam I didn't, however, touch the parameters about secondary mapping.

Thank you for your helpful comments and your patience!

wshuai294 commented 2 months ago

Hi, I found that the reads from the bacterial artificial chromosome (BAC) are wrongly assigned to DPB1. It would be a good idea to map the long reads to the BAC library to remove such reads before using SpecHLA.

Here is how I find this: First, I check the reads assigned to DPB1 in the NA19238 PacBio hifi sample, where the read m64039_190830_040857/79364868/ccs is weird, cause only a small middle fraction can map to DPB1 allele (cigar shows). If a read is derived from DPB1, there will be three cases: the left end of the read map to the DPB1, the middle part can map to the whole DPB1 gene, or the right end of the read map to the DPB1. It is impossible that only a small middle fraction maps to DPB1.

To see what the read is, I map the reads to the NR database using ncbi-blast, and found this read can perfectly map to Homo sapiens BAC clone RP11-576F1 from 2, complete sequence. The alignment result is

Maybe you can extract some reads assigned to DPB1 and map them to the NR database using ncbi-blast too. If it's true, we can remove them by aligning the raw reads to the BAC library before using SpecHLA.

karoliinas commented 2 months ago

Wow, great, thank you for your trouble! Interesting, that there is an identical sequence in BAC and DPB1. I will test blast, and am currently downloading the databases as my data is sensitive and I can't use the web tool. Would this BAC presence be due to contamination? And where does it come from, I understand it is used as cloning vector, but I am dealing with whole genomes measured from blood. Sorry, seems that I have questions again, but they are not urgent. I'll let you know when I have blasted a few DPB1 -reads.

wshuai294 commented 2 months ago

I also guess the BAC sequences are contamination introduced in sequencing; they might be used for sequencing accuracy assessment or human DNA amplification. However, further surveys are needed to confirm this. Waiting for your alignment results.

karoliinas commented 2 months ago

Hi, after blasting a number of the dubious reads, they do hit some kind of BACs, though with slightly different id than in your screenshot. I tried running HiFiAdapterfilt but seems that the sequences aren't included in the list of PacBio adapters, so that was not helpful. Not sure which BAC library to align the reads to, and how to go about it. How big of an effect would you think these reads have in the hla-haplotyping? And if it's big, would it affect all the genes or only DPB1?

wshuai294 commented 2 months ago

Thanks for the reply. We have not assessed the influence of BAC yet. I think these BAC-derived reads only harm high-resolution typing results (e.g., 8-digit). I will construct a pipeline to remove such reads, and see whether it improves the performance of SpecHLA. If it does, I will add it to long_read_typing.py.

wshuai294 commented 2 months ago

By the way, what's the location of this HLA DPB1 region, i.e., ~500bp region in all the samples with spuriously high depth (1000-2000)? In my data, this region is DPB1:2248-2832bp based on the SpecHLA reference.

karoliinas commented 2 months ago

Ok, this is a relief, that the results should hold :) And like I said, the results are pretty identical to those from hla*la.

The region in my data is 2265-2832, slightly different start. Many thanks for your help and insights!

wshuai294 commented 2 months ago

It is great to hear the consistency between the results of HLA*LA and SpecHLA, indicating they are reliable. I am collecting the BAC sequences from NCBI.

deepomicslab / SpecHLA

Question about minimap2 parameters in long_read_typing.py #28