Illumina / ExpansionHunter

A tool for estimating repeat sizes
Other
182 stars 51 forks source link

"Encountered empty query" #94

Open jwagnerhki opened 4 years ago

jwagnerhki commented 4 years ago

With master 274903d but also tag v3.1.2 neither a Nanopore based BAM (NGLMR, also minimap2) nor Illumina 150bp based BAM (bwa mem, also minimap2) seem to work at all in ExpansionHunter.

With the Nanopore based BAM there is no error shown, but the resulting VCF is blank apart of the header. There are just warnings(?) like:

2020-01-12T21:22:54,[Skipping 1cfb1d7c-1277-4ed6-875e-65dfa75272d8/2 because it is unpaired] 2020-01-12T21:22:54,[Skipping 219c7422-e73c-4d63-9097-a01bd7facce9/2 because it is unpaired] 2020-01-12T21:22:54,[Skipping 9d91b9b5-f2d1-4b5d-9aa1-6f91b4ab1a01/2 because it is unpaired] 2020-01-12T21:22:54,[Skipping locus TCF4 due to low coverage]

With the Illumina based BAM the search stops immediately with "Encountered empty query".

$ ExpansionHunter --reads kitID_hg38_bwamem_v2.bam --reference /sf_Genome/hg38/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna --output-prefix kitID_hg38 --variant-catalog variants.json --log-level=trace

2020-01-12T21:08:56,[Starting Expansion Hunter v3.2.0] 2020-01-12T21:08:56,[Workflow parameter object is initialized with HeuristicParameters(regionExtensionLength=1000, qualityCutoffForGoodBaseCall=20, skipUnaligned=true, alignerType=dag-aligner, kmerLenForAlignment=14, paddingLength=10, seedAffixTrimLength=14)] 2020-01-12T21:08:56,[Analyzing sample 60820188482512_hg38_bwamem_v2] 2020-01-12T21:08:56,[Initializing reference /sf_Genome/hg38/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna] 2020-01-12T21:08:56,[Loading variant catalog from disk variants.json] 2020-01-12T21:08:56,[Running sample analysis in seeking mode] 2020-01-12T21:08:57,[Encountered empty query for A00910:49:HW7Y5DSXX:3:2537:3278:34632/1]

egor-dolzhenko commented 4 years ago

Thank you for reporting the issues.

ExpansionHunter is designed for short-read data and so it will not work on Nanopore reads.

And it looks like the Illumina 2x150bp BAM file may contain some reads with an empty sequence. May I ask how exactly this file was generated? Were the reads hard-clipped by any chance? If yes, it would be better to run ExpansionHunter on non-clipped data since the program requires that all reads have (nearly) the same length.

If it is permissible to share a small subset of the BAM file, please feel free to send it to me at edolzhenko@illumina.com. I would be happy to help diagnose the problem.

Best wishes, Egor

jwagnerhki commented 4 years ago

The BAM file was generated with:

bwa mem -a -M -Y -V -L 100,100 -t 4 -R '@RG\tID:1\tLB:LB0\tPL:PL0\tPU:PU0\tSM:${KIT}' \ ../GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.bgz \ ${KIT}_SA_L001_R1_001.fastq_v2.gz \ ${KIT}_SA_L001_R2_001.fastq_v2.gz | samtools view -u -F 4 -@ 8 -t ../GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.bgz.fai - \ samtools sort -@ 8 -o ${KIT}_hg38_bwamem_v2.bam -

I think no hard clipping was used. Actually tried to avoid soft clipping too (-L 100,100) as a too high percentage of the fastq reads were mapping, but, the -L option did not really help.