Closed theal77 closed 4 years ago
See a related open issue here with more details (https://github.com/WGSExtract/WGSExtract.github.io/issues/1#issuecomment-674252940).
Try the mingw version of samtools in that sane installation. (We are assuming you are using the WGS Extract Betav2 port to Win10 of the htslib tools.)
If it still fails, we will direct you to our Beta program install that has a more recent port of the newest samtools 1.10 tool release that htslib now verifies on Win10 as part of their development.
If that does not work, then we can only suggest to try a MacOSX or Ubuntu Linux OS machine. htslib / samtools are developed on those machines.
Of course, please check the other possible errors like a bad download of the CRAM from the Nebula AWS or EBI server itself. But your error is likely from the code itself; not necessarily a badly formatted file.
Thanks Randy,
"mingw" version worked! :)
Quick question:
Nebula uses hg38 assembly, but WGS Extract CRAM to BAM uses GRCh38.
WGS Extract interface saying that the generated BAM header does not indicate a reference genome.
What reference genome should I specify for the generated BAM file?
Thanks!
Independent of what WGS Extract reports, the reference model used for the Nebula BAM/CRAM is reference_genomes/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz. WGS Extract identifies the Nebula BAM reference model as HG38. When a BAM is identified as HG38, WGS Extract (currently) uses the reference model file hg38.fna.gz. It only uses the above Nebula referenced file if it identifies a BAM as GRCh38.
Both the hg38.fna.gz and GCA*fna.gz reference model files use the SN convention of "chr1" (for chromosome 1) and hence why the issue crops up. WGS Extract presumes all GRChnn models name autosomes strictly numerically and not alphabetically. And hence why it gets it wrong. Based on the current code, a GRCh38 model will never be identified.
A way you could fool the tool is to rename the current hg38.fna.gz file to something else (add .bak?) and copy the GCA*fna.gz file to also be named hg38.fna.gz. Do this only while processing your Nebula BAM files that you know are GRCh38. The place where the type of the xx38 reference model makes a big difference is when doing microarray file generation. But, to be honest, you are best converting your BAM to an HG19 reference model before generating Microarray files. (The current WGS Extract does not have this remap feature built in but you can follow the Bioinformatics for Newbies doc and run the commands yourself.)
We are working to fix WGS Extract code to identify the actual model used and not try to classify the BAM too broadly or simply the way it does now For most uses, the current selection is not a problem as these two models are really similar. This has not been modified in the current under-development pre-BETA release either as I have been working at rewriting the whole reference model system since June. Let me see if there is a quick stop-gap patch I can do that will fix the current code to more properly handle Nebula BAM files in the mean time. The best way to distinguish these two reference models is HG38has the analysis set and so has 595 SN entries. The GRCh38 model in WGS Extract is no analysis and so is only 195 entries. The Nebula BAM header only has 195 SN entries.
You can see a more detailed view of Reference Genome model file analysis developed in the Document Determining Your BAMs Reference Model at https://bit.ly/34CO0vj. WIth an accompanying spreadsheet at https://bit.ly/2ZmYPAg.
Thank you Randy, this is very helpful!
Fyi, Nebula under FAQ states they are using hg38 version, so it added more to my confusion ;) https://nebula.org/faqs/ How do you process the sequencing data? We align your sequencing data to the hg38 version of the human genome using a GATK-based pipeline.
Honestly, I think the model mentioned is likely a version that could be classified as HG38. Hence my larger reference model study :) It could be that current WGSE tool has an incorrect generic GRCh38 model as it is :) I will mark this closed for now as you have a work-around and it is in progress to be fixed with current tool development.
Hi, I tried to run Nebula CRAM to BAM conversion under Windows 10 using the following command (twice): _samtools view -b -@ 4 -T GCA_000001405.15_GRCh38_no_alt_analysisset.fna.gz -o new.bam my.cram
Both times command failed with error: "samtools cygwin_exception::openstackdumpfile"
The error occurred probably towards the end as BAM file got to 55GB both times (CRAM is 44.6GB) and two generated Stackdump files are the same:
"Exception: STATUS_ACCESSVIOLATION at eip=61161878
eax=36ED38D3 ebx=0678B0F1 ecx=019E2C3C edx=00000000 esi=491D55DF edi=E6230008
ebp=FFB7CB58 esp=FFB7CB4C program=E:\DNA\WGSExtract\programs\samtools-cygwin\samtools.exe, pid , thread samtools
cs=0023 ds=002B es=002B fs=0053 gs=002B ss=002B
Stack trace:
Frame Function Args
End of stack trace"