Closed jkotl0327 closed 4 years ago
Could you share the BAM here, please?
Github says file type unsupported, this is the link to the selection of BAMs I got it from, it is I4517.hg19.bam: https://www.ebi.ac.uk/ena/browser/view/PRJEB37057
Last login: Thu Jul 9 11:42:38 on ttys000
/usr/local/bin/python3 /Users/jeremyk/Desktop/WGSExtract/programs/wgsextract/wgsextract.py; exit
jeremyk@Jeremys-MBP ~ % /usr/local/bin/python3 /Users/jeremyk/Desktop/WGSExtract/programs/wgsextract/wgsextract.py; exit
This little window shows what WGSExtract is doing in the background
while it runs tasks that the user initiated.
This can be a help if you experience any errors.
The window can be minimized if there are no errors.
dos2unix: converting file /Users/jeremyk/Desktop/WGSExtract/temp/get_samtools_stats.sh to Unix format...
dos2unix: converting file /Users/jeremyk/Desktop/WGSExtract/temp/extract23.sh to Unix format...
Starting mpileup... Please be patient!
[warning] samtools mpileup option v
is functional, but deprecated. Please switch to using bcftools mpileup in future.
[E::fai_load3_core] Failed to open FASTA file /Users/jeremyk/Desktop/WGSExtract/reference_genomes//human_g1k_v37.fasta.gz
[tabix] the compression of '/Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_raw.vcf.gz' is not BGZF
Mpileup completed. Starting SNP calling...
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
Failed to read from /Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_raw.vcf.gz: unknown file type
[tabix] the compression of '/Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_called.vcf.gz' is not BGZF
SNP calling completed. Starting annotation...
Failed to read from /Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_called.vcf.gz: unknown file type
[tabix] the compression of '/Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_annotated.vcf.gz' is not BGZF
Annotation completed. Starting extraction from VCF ...
Failed to read from /Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_annotated.vcf.gz: unknown file type
Extraction from VCF completed. Sorting by chromosome and position ...
dos2unix: converting file /Users/jeremyk/Desktop/I4517.hg19__CombinedKit_All_SNPs_RECOMMENDED.txt to Unix format...
adding: I4517.hg19__CombinedKit_All_SNPs_RECOMMENDED.txt (deflated 46%)
Generating file in format 23andMe_V5
Traceback (most recent call last):
File "/Users/jeremyk/Desktop/WGSExtract/programs/wgsextract/aconv.py", line 247, in
zip error: Nothing to do! (/Users/jeremyk/Desktop/I4517.hg19_23andMe_V5.zip)
Make sure you actually have human_g1k_v37.fasta.gz in /WGSExtract/reference_genomes/ looks like samtools cannot find it.
I have human_g1k_v37.fasta.gz.fai & human_g1k_v37.fasta.gz.gzi & human_g1k_v37.dict, not just human_g1k_v37.fasta.gz though. Do I need it?
Human_g1k_v37.fasta.gz is essential as it is a genome reference.
How can I get it? Everything was working before.
Here's one location.
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
The error message was the BAM is not in bgzip format. It was likely gzipped instead. There is a tool called htsfile (part of htslib, the overall package that originates samtools and bcftools) that will give you this diagnosis on BAMs, FASTQs, FA's, etc. I am out at the moment but can give you the specific info when i get home.
Randy (from phone, DYACs involved)
On July 9, 2020 1:01:41 PM jkotl0327 notifications@github.com wrote:
I have human_g1k_v37.fasta.gz.fai & human_g1k_v37.fasta.gz.gzi & human_g1k_v37.dict, not just human_g1k_v37.fasta.gz though. Do I need it? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Thank you
So I am back at a desktop. I misread the message script from
Teepean on my phone. It says the FASTA file supplied by the
program is not in bgzip format, not the BAM.
The current scripts do not account for an error early in the
pipeline and so later commands simply fail because the
intermediate files have not been properly created. It appears the
script never got to creating a CombinedKit file which means all
the final files after that will be empty as the scripts simply
"cat" the headers and then try to append the extracted values from
the CombineKit file into that. If the CombinedKit is empty, there
is nothing to extract. This is not normal but because you are
using odd-sourced files, the WGS Extract tool is not currently
robust to pick up on these variances.
Give me a few more minutes to go back into an original, original
Feb2020 release to make sure all the files have been properly
built in the reference models. And run it on your example.
I did download your reference BAM mentioned and it did seem to
be in BGZF (bgzip) format. The program htsfile is not in the
Feb2020 release (but will be in the next release). I jumped to a
conclusion on my quick read of Teepean's run as I have seen a
number of the reference models and BAM files incorrectly created
with gzip instead of bgzip.
htsfile *.{bam, bam.gz, fastq.gz, fai.gz}
GRCh38_full_analysis_set_plus_decoy_hla.fa.gz: FASTA
BGZF-compressed data
HG01356.final.cram: CRAM version 3.0 compressed sequence
data
HG01356_R1.fastq.gz: FASTQ gzip-compressed sequence data
HG01356_R2.fastq.gz: FASTQ gzip-compressed sequence data
samtools index or samtools faidx will not work if the file is
gzip compressed. Must be bgzip.(BGZF). Do not recall, but it
appears BGZF format files will also be reported simply as
"compressed sequence data" instead of "BGZF-compressed data". I
need to look at the source code again to figure out what the
difference there is. For sure if it says gzip-compressed, then
you need to do the following (as an example):
gunzip your.bam.gz
bgzip your.bam
Note that gunzip requires the file name end in .gz extension.
Which it will strip after completion. bgzip will then add the .gz
extension (although it really should add .bgz). Just a funny
requirement for these tools. Both replace the existing file and
rename it. This with some of the extension naming dependencies by
some of the tools ends up requiring one to rename files as part of
a longer script. Not pretty.
I have been running into this a lot with getting reference models
and similar files. People are not careful when posting. Or
sometimes repost an update carelessly without checking the correct
compression was used. Hopefully that is not the case for the files
delivered in the WGS Extract tool release of 25 Feb 2020 :) Our
new release is not going to include reference models but instead
automatically download them as needed. As such, we have been
running into this issue and working to auto-detect and correct
when it happens. Will likely then extend the logic back into the
BAM CRAM files as appropriate; and FASTQ and VCF when we start
accepting them.
If you (Jeremy) or Teemu (Teepean) figure it out first, report
back here.
Randy
On 7/9/2020 2:50 PM, Randy (H600)
wrote:
The error message was the BAM is not in bgzip
format. It was likely gzipped instead. There is a tool called
htsfile (part of htslib, the overall package that originates
samtools and bcftools) that will give you this diagnosis on
BAMs, FASTQs, FA's, etc. I am out at the moment but can give
you the specific info when i get home.
Randy
(from phone, DYACs involved)
On July 9, 2020 1:01:41 PM
jkotl0327 <notifications@github.com> wrote:
I have human_g1k_v37.fasta.gz.fai &
human_g1k_v37.fasta.gz.gzi & human_g1k_v37.dict,
not just human_g1k_v37.fasta.gz though. Do I need it?
—
You are receiving this because you are subscribed to
this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Hi Randy, Sorry, but as you know I'm not that good with coding. You might need to dumb it down a little for me as I'm having some trouble following what we need to figure out and report back.
So I tried and failed to recreate Jeremy's issue on a Windows
machine (not MacOS Catalina like he has). I also successfully ran
and failed to recreate the problem on the current development
release. (I do not have my Mac VM's back yet since burning out my
computer and getting a new one around the time I was last
debugging Jeremy's MacOS install issues).
On my machine, the commands before and at the error report in
the current public release are:
"C:/WGSE/Betav2b/programs/samtools-mingw/samtools" mpileup -C 50
-v -l
"C:/WGSE/Betav2b/programs/extract23/All_SNPs_combined_RECOMMENDED_GRCh37_ref.tab.gz" -f ""C:/WGSE/Betav2b/reference_genomes/\human_g1k_v37.fasta.gz"" "C:/Users/Randy/Downloads/I4517.hg19.bam" > "C:/WGSE/Betav2b/temp/temp_autosomes_raw.vcf.gz" "C:/WGSE/Betav2b/programs/samtools-mingw/tabix" -p vcf "C:/WGSE/Betav2b/temp/temp_autosomes_raw.vcf.gz The tabix error of not finding the file in BGZF format is simply because there is an empty or non-existent file temp/temp_autosomes_raw.vcf.gz . So the command before the tabix command is failing to produce the file. Which is the samtools mpileup which reads the reference file human_g1k_v37.fasta.gz. That file is provided by the WGS Extract release in the reference_genomes directory. But Jeremy's command failed for that file read. So Jeremy, can you do an "ls -l" of the reference_genomes folder so we can see its content. That file is static with the release and should not be changing. If you have run WGS Extract Autosomal extract before, that folder was being used. Maybe not this exact reference model, but one in that folder.
Also Jeremy, can you let us know what version of samtools you have
with the MacOS Catalina install -- execute "samtools" on a line by
itself to report its version.
Teemu had it correct in that the reference genome was not found.
But the issue is we deliver that file and its index files with the
tool and so it should always be there. Maybe a MacOS install issue
again of trying to find the file via the path?
Randy
Note the above script and Jeremy's exhibit some of the minor
errors already fixed relating to quotes and slashes.
As an example, here is my output of the requests for
information for the same version but on Windows which has
Marko/Teemu's CygWin/MinGW tool release:
$ pwd
/cygdrive/c/wgse/betav2b/reference_genomes
$ ls -l
total 4461796
-rwxrwx---+ 1 Randy None 33956 Feb 12 21:25
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.dict*
-rwxrwx---+ 1 Randy None 886344618 Dec 23 2019
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz*
-rwxrwx---+ 1 Randy None 7804 Dec 23 2019
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.fai*
-rwxrwx---+ 1 Randy None 770648 Dec 23 2019
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.gzi*
-rwxrwx---+ 1 Randy None 3162 Feb 12 21:23 hg19.dict*
-rwxrwx---+ 1 Randy None 881223138 Dec 23 2019 hg19.fa.gz*
-rwxrwx---+ 1 Randy None 783 Dec 23 2019 hg19.fa.gz.fai*
-rwxrwx---+ 1 Randy None 771400 Dec 23 2019 hg19.fa.gz.gzi*
-rwxrwx---+ 1 Randy None 81573 Feb 12 21:22 hg38.dict*
-rwxrwx---+ 1 Randy None 1021804486 Dec 23 2019 hg38.fa.gz*
-rwxrwx---+ 1 Randy None 25606 Dec 23 2019 hg38.fa.gz.fai*
-rwxrwx---+ 1 Randy None 814344 Dec 23 2019 hg38.fa.gz.gzi*
-rwxrwx---+ 1 Randy None 11092 Feb 12 21:21 hs37d5.dict*
-rwxrwx---+ 1 Randy None 892384594 Dec 23 2019 hs37d5.fa.gz*
-rwxrwx---+ 1 Randy None 2813 Dec 23 2019 hs37d5.fa.gz.fai*
-rwxrwx---+ 1 Randy None 781800 Dec 23 2019 hs37d5.fa.gz.gzi*
-rwxrwx---+ 1 Randy None 11673 Feb 12 21:19 human_g1k_v37.dict*
-rwxrwx---+ 1 Randy None 882998580 Dec 23 2019
human_g1k_v37.fasta.gz*
-rwxrwx---+ 1 Randy None 2746 Dec 23 2019
human_g1k_v37.fasta.gz.fai*
-rwxrwx---+ 1 Randy None 772920 Dec 23 2019
human_g1k_v37.fasta.gz.gzi*
$ ../programs/samtools-mingw/samtools.exe
Program: samtools (Tools for alignments in the SAM format)
Version: 1.6-139-gc7ef384 (using htslib 1.6-304-g71d6fa7)
For the development release, the version of samtools is:
$ pwd
/cygdrive/c/wgse/dev/programs/win10tools/bin
$ ./samtools.exe
Program: samtools (Tools for alignments in the SAM format)
Version: 1.10 (using htslib 1.10)
So MacOS Samtools cannot be greater than 1.10 and so it cannot be
an issue of the samtools mpileup command being formally
deprecated.
On 7/9/2020 11:44 AM, jkotl0327 wrote:
Last login: Thu Jul 9 11:42:38 on ttys000
/usr/local/bin/python3
/Users/jeremyk/Desktop/WGSExtract/programs/wgsextract/wgsextract.py;
exit
jeremyk@Jeremys-MBP ~ % /usr/local/bin/python3
/Users/jeremyk/Desktop/WGSExtract/programs/wgsextract/wgsextract.py;
exit
This little window shows what WGSExtract is doing in the
background
while it runs tasks that the user initiated.
This can be a help if you experience any errors.
The window can be minimized if there are no errors.
dos2unix: converting file
/Users/jeremyk/Desktop/WGSExtract/temp/get_samtools_stats.sh to
Unix format...
dos2unix: converting file
/Users/jeremyk/Desktop/WGSExtract/temp/extract23.sh to Unix
format...
Starting mpileup... Please be patient!
[warning] samtools mpileup option v is functional,
but deprecated. Please switch to using bcftools mpileup in
future.
[E::fai_load3_core] Failed to open FASTA file
/Users/jeremyk/Desktop/WGSExtract/reference_genomes//human_g1k_v37.fasta.gz [tabix] the compression of '/Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_raw.vcf.gz' is not BGZF Mpileup completed. Starting SNP calling... Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid Failed to read from /Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_raw.vcf.gz: unknown file type [tabix] the compression of '/Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_called.vcf.gz' is not BGZF SNP calling completed. Starting annotation... Failed to read from /Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_called.vcf.gz: unknown file type [tabix] the compression of '/Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_annotated.vcf.gz' is not BGZF Annotation completed. Starting extraction from VCF ... Failed to read from /Users/jeremyk/Desktop/WGSExtract/temp/temp_autosomes_annotated.vcf.gz: unknown file type Extraction from VCF completed. Sorting by chromosome and position ... dos2unix: converting file /Users/jeremyk/Desktop/I4517.hg19__CombinedKit_All_SNPs_RECOMMENDED.txt to Unix format... adding: I4517.hg19__CombinedKit_All_SNPs_RECOMMENDED.txt (deflated 46%) Generating file in format 23andMe_V5 Traceback (most recent call last): File "/Users/jeremyk/Desktop/WGSExtract/programs/wgsextract/aconv.py", line 247, in convert_adna() File "/Users/jeremyk/Desktop/WGSExtract/programs/wgsextract/aconv.py", line 185, in convert_adna while pos1_smaller_than_pos2(called_chrom, called_pos, templ_chrom, templ_pos) NameError: name 'called_chrom' is not defined dos2unix: /Users/jeremyk/Desktop/I4517.hg19_23andMe_V5.txt: No such file or directory dos2unix: Skipping /Users/jeremyk/Desktop/I4517.hg19_23andMe_V5.txt, not a regular file. zip warning: name not matched: /Users/jeremyk/Desktop/I4517.hg19_23andMe_V5.txt zip error: Nothing to do! (/Users/jeremyk/Desktop/I4517.hg19_23andMe_V5.zip) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/WGSExtract/WGSExtract-Dev/issues/3#issuecomment-656202937", "url": "https://github.com/WGSExtract/WGSExtract-Dev/issues/3#issuecomment-656202937", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
jeremyk@Jeremys-MBP WGSExtract % cd ~/Desktop/WGSExtract/reference_genomes jeremyk@Jeremys-MBP reference_genomes % ls -l total 8184 -rwx------ 1 jeremyk staff 33956 Feb 12 21:25 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.dict -rwx------ 1 jeremyk staff 7804 Dec 23 2019 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.fai -rwx------ 1 jeremyk staff 770648 Dec 23 2019 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.gzi -rwx------ 1 jeremyk staff 3162 Feb 12 21:23 hg19.dict -rwx------ 1 jeremyk staff 783 Dec 23 2019 hg19.fa.gz.fai -rwx------ 1 jeremyk staff 771400 Dec 23 2019 hg19.fa.gz.gzi -rwx------ 1 jeremyk staff 81573 Feb 12 21:22 hg38.dict -rwx------ 1 jeremyk staff 25606 Dec 23 2019 hg38.fa.gz.fai -rwx------ 1 jeremyk staff 814344 Dec 23 2019 hg38.fa.gz.gzi -rwx------ 1 jeremyk staff 11092 Feb 12 21:21 hs37d5.dict -rwx------ 1 jeremyk staff 2813 Dec 23 2019 hs37d5.fa.gz.fai -rwx------ 1 jeremyk staff 781800 Dec 23 2019 hs37d5.fa.gz.gzi -rwx------ 1 jeremyk staff 11673 Feb 12 21:19 human_g1k_v37.dict -rwx------ 1 jeremyk staff 2746 Dec 23 2019 human_g1k_v37.fasta.gz.fai -rwx------ 1 jeremyk staff 772920 Dec 23 2019 human_g1k_v37.fasta.gz.gzi
Couldn't figure out the command for Samtools executable so I got this by clicking on the executable:
Program: samtools (Tools for alignments in the SAM format) Version: 1.10 (using htslib 1.10.2)
Usage: samtools
Commands: -- Indexing dict create a sequence dictionary file faidx index/extract FASTA fqidx index/extract FASTQ index index alignment
-- Editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header targetcut cut fosmid regions (for fosmid pool only) addreplacerg adds or replaces RG tags markdup mark duplicates
-- File operations collate shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group quickcheck quickly check if SAM/BAM/CRAM file appears intact fastq converts a BAM to a FASTQ fasta converts a BAM to a FASTA
-- Statistics bedcov read depth per BED region coverage alignment depth and percent coverage depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck)
-- Viewing flags explain BAM flags tview text alignment viewer view SAM<->BAM<->CRAM conversion depad convert padded BAM to unpadded BAM
After downloading the reference file suggested above and putting it into reference_genomes, the issue seemed to resolve itself, thank you!
The index files are there but not the actual reference genomes. For any of the reference genomes.. Don't know how it was deleted from your release. A reinstall would have brought it back as well. Not sure how it could have disappeared. The tool only reads from that folder. Each reference genome is almost 1 GB. Did you delete them to save space?
Randy (from phone, DYACs involved)
On July 10, 2020 2:18:47 PM jkotl0327 notifications@github.com wrote:
jeremyk@Jeremys-MBP WGSExtract % cd ~/Desktop/WGSExtract/reference_genomes jeremyk@Jeremys-MBP reference_genomes % ls -l total 8184 -rwx------ 1 jeremyk staff 33956 Feb 12 21:25 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.dict -rwx------ 1 jeremyk staff 7804 Dec 23 2019 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.fai -rwx------ 1 jeremyk staff 770648 Dec 23 2019 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.gzi -rwx------ 1 jeremyk staff 3162 Feb 12 21:23 hg19.dict -rwx------ 1 jeremyk staff 783 Dec 23 2019 hg19.fa.gz.fai -rwx------ 1 jeremyk staff 771400 Dec 23 2019 hg19.fa.gz.gzi -rwx------ 1 jeremyk staff 81573 Feb 12 21:22 hg38.dict -rwx------ 1 jeremyk staff 25606 Dec 23 2019 hg38.fa.gz.fai -rwx------ 1 jeremyk staff 814344 Dec 23 2019 hg38.fa.gz.gzi -rwx------ 1 jeremyk staff 11092 Feb 12 21:21 hs37d5.dict -rwx------ 1 jeremyk staff 2813 Dec 23 2019 hs37d5.fa.gz.fai -rwx------ 1 jeremyk staff 781800 Dec 23 2019 hs37d5.fa.gz.gzi -rwx------ 1 jeremyk staff 11673 Feb 12 21:19 human_g1k_v37.dict -rwx------ 1 jeremyk staff 2746 Dec 23 2019 human_g1k_v37.fasta.gz.fai -rwx------ 1 jeremyk staff 772920 Dec 23 2019 human_g1k_v37.fasta.gz.gzi Couldn't figure out the command for Samtools executable so I got this by clicking on the executable: Program: samtools (Tools for alignments in the SAM format) Version: 1.10 (using htslib 1.10.2)Usage: samtools [options]
Thank you and teepean for the help -- not sure why it was deleted either, I definitely didn't touch anything in that folder.
Hi Randy, This is Jeremy again. I've been using WGSExtract over the past few weeks and I've come across an issue. The program has been handling some of the low-quality BAMs I've given it pretty well so far. For some reason when I went through file generation yesterday (23andme, living dna, myheritage, etc) all of the files came out as blank text documents. I tried a FASTA file as well but no file was generated at all. Since I've gotten files from BAMs even worse than this one, I started thinking that maybe the quality of the BAM isn't the problem. Have you ever come across this issue before, where the files that come out are completely blank? I even tried redownloading the BAM and everything.
Thanks, Jeremy