bcgsc / HLAminer

⛏ HLA predictions from NGS shotgun data
Other
51 stars 15 forks source link

About HLAminer analysis on WES data #24

Closed HugoMananet closed 2 years ago

HugoMananet commented 2 years ago

Hello,

I routinely use HLAminer on WES data, recently the design of this analysis was changed.

The regions covering the class I HLA genes have been reduced, but are still covering the RefSeq coordinates for these genes.

I have also detected empirically that the overall depth of coverage of these regions has also been significantly reduced for the samples sequenced with the new design.

For the samples generated with the old design, HLAminer detected HLAtypes for the Class I genes without any problems (using the HPTASR method via the HPTASRwgs_classI.sh script) and no changes needed to any threshold parameters.

For the samples generated with the new design, HLAminer does not detect any HLA types for these genes.

In your opinion, which one of these changes could influence the HLAminer analysis the most, the lowering of the depth of coverage or the smaller size of the regions targeted by the design, or could the problem lie elsewhere ?

Thank you !

warrenlr commented 2 years ago

Thank you for your message, I'm glad to hear that HLAminer is useful to you.

Regarding your question, let me answer with a question: does the pipeline generate contigs >= 200bp (default)? if not, my hunch is that the lower coverage is preventing targeted read assembly with TASR, or the contigs are too short (<200) and don't make it to the downstream analysis. If there are contigs >200bp, perhaps the min. score threshold can be adjusted (see hlaminer.pl). Both parameters can be changed on the shell scripts. Aside from that, did the WES probe array design change (i.e. are these regions captured at all -- in other words, are there HLA reads in your set) ?

Cheers, Rene

HugoMananet commented 2 years ago

Hi,

Thank you for your fast answer,

With the 'failing design' I always get only one contigs of length >200bp typically looking like this :

contig1|size277|read594|cov172.32 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

which can probably explain that the downstream analysis doesn't provide any results...

I have asked for details on the probe array design change to my collegues in charge of this, but i suspect it might have.

If this is the case, would the increase in quantity and/or diversity of probes get better coverage and thus result in more and better quality contigs of length >200 resulting of the targeted read assembly with TASR ?

Thanks again, Hugo

warrenlr commented 2 years ago

would the increase in quantity and/or diversity of probes get better coverage and thus result in more and better quality contigs of length >200 resulting of the targeted read assembly with TASR ?

in theory, yes.

My suggestion would be to : 1) QC your sequences (e.g. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) 2) Align to HLA seq database, the same number of read pairs from a previous design and the new design and run basic stats on the results (e.g. Samtools Flagstat)

Are there other contigs (i.e. <200 bp) ? If so, and the TASR assemblies are fragmented, you may modify the shell script to analyze shorter sequences

Restrict 200nt+ contigs

cat TASRhla.contigs |perl -ne 'if(/size(\d+)/){if($1>=200){$flag=1;print;}else{$flag=0;}}else{print if($flag);}' >TASRhla200.contigs

to: (say analyzing contigs>=150bp)

cat TASRhla.contigs |perl -ne 'if(/size(\d+)/){if($1>=150){$flag=1;print;}else{$flag=0;}}else{print if($flag);}' >TASRhla200.contigs

You don't have to rename "TASRhla200.contigs" on each line but can if you prefer.

also add: -s 500 -z 150. to the HLAminer.pl line

../bin/HLAminer.pl -b tig_vs_hla-ncbi.coord -r hla_vs_tig-ncbi.coord -c TASRhla200.contigs -h ../database/HLA-I_II_GEN.fasta -s 500 -z 150

the plot thickens..

HugoMananet commented 2 years ago

Hi again,

Per your suggestions :

  1. QC your sequences (e.g. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

We do QC every sample analyzed so i tried to pinpoint the differences existing between the QC of samples generated with the old vs those with new design.

Fastqc doesn't give lots of clues except maybe that : 1- The reads generated by the new design are 10bp shorter (due in part to the trimming step necessary) 2- The R2 fastq file still contains polyG sequences (even if its <1%)

To address the second point, I eliminated the polyG sequences and ran hlaminer in case they were the ones bothering the assembly of others more meaninful sequences.

  1. Align to HLA seq database, the same number of read pairs from a previous design and the new design and run basic stats on the results (e.g. Samtools Flagstat)

QC for the alignments steps especially show the differences between the 2 designs (Mapped reads per contig).

Here are the results of the QC analysis done with 22BM-363, 22BM-383-EX2, 22BM-435-EX2 and 22BM-484-EX2 being the samples of the old design(with hlaminer results). QC_html_reports.zip

Are there other contigs (i.e. <200 bp) ? If so, and the TASR assemblies are fragmented, you may modify the shell script to analyze shorter sequences

I also applied your suggestion to use contigs shorter than 200bp for the analysis but with no change in results.

There is indeed other, shorter contigs, but their coverage is <=10 and so the HLAminer analysis with shorter contigs doesn't give any results in the end.

I had tried before just to change the -s parameter but i had to go very low for the score threshold (<100) to get any results at all (could the results be trusted ?).

I am yet to hear from my colleague concerning the WES probe array design change, even though now it is almost certain there was a change considering what we can observe in the QC files and the contigs coverages.

Thanks again !

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in HLAminer!