WGSExtract / WGSExtract-Dev

WGS Extract Developers Repository
GNU General Public License v3.0
19 stars 7 forks source link

impossible to select the correct reference genome #14

Closed calledit closed 2 months ago

calledit commented 2 months ago

I just created my Bam file using GRCh38 patch 14 (the latest GRCh38) https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz

The issue I am having is that WGSExtract won't allow me to select that reference genome, it will only allow me to select one of the "T2T" references or "unknown".

If I select "unknown" the program crashes, if i select one of the "T2T" references i can't export the "microarray files".

RandyHarr commented 2 months ago

It is possible we have not added support for patch 14 everywhere. Where is it you are trying to select that reference genome? What version of the software are you using and on what platform?

In general, most tools are not patch aware. So using the patched version does not help. Most DB's like dbSNP which defines rsIDs are not patch aware either. Patches do not change the coordinates of the main reference model. But does offer "fixed" sequences to change the values at particular coordinate positions. Alignment models (such as the hsxxx ones from 1KGenome project that BWA was built from) can only usually utilize fixed patch sequences if the original (main) model has the corresponding sections changed to lower case. The hsxxx models have not been updated to include the patched version sequences or case conversion since 2013. I am not aware of any variant caller that uses patch sequences beyond what are in the hsxxx models. But GATK, DeepVariant, etc may be using them if found.

calledit commented 2 months ago

Ubuntu 24.04 Version: 44.4

First I press the button to open the BAM file. Then a new window opens with a bunch of buttons for various references. One of the genomes in the list is GRCh38 but that button is disabled.

I am not familiar with the code base, but based on this comment and from what I can gauge in this file: https://github.com/WGSExtract/WGSExtract-Dev/blob/6ecae9598bfff6456b8cffd113f72fbac0b31602/program/bamfiles.py#L720 It seams there are various heuristics that are used to determine which reference genoms it could and could not be. My guess is that those heuristics are to narrow in some way

RandyHarr commented 2 months ago

Realize that the checked in code is very old compared to what you get in the release. The release includes all source files.

We, at minimum, strive to open and load any BAM / CRAM so you can, at minimum, unalign it back to FASTQs so it can be realigned to something recognizable. Second, we try and report minimal stats even if a reference model cannot be recognized. It sounds like we are failing at that.

[Edited to capture Ubuntu 24.04 output from WGSE v4.44p4] Please turn DEBUG on and look for lines similar to the below in the log window:

DEBUG: Set BAM/CRAM/SAM (request): /mnt/d/DNA/WGS/RandyYseq/22731_bwa-mem_rCRS_chrM.bam DEBUG: Starting command: /bin/bash -x /home/randy/WGSExtractv4/temp/1140/GetBAMHeader.sh DEBUG: In Please Wait: Extracting BAM Header, Expected Wait: 2 seconds, Start --- STARTING: GetBAMHeader.sh at Sat Jun 1 14:35:17 2024 .... --- FINISHED: GetBAMHeader.sh at Sat Jun 1 14:35:17 2024 ( 0 seconds to run) DEBUG: Bam Coord Sorted? True DEBUG: SN: Chr, M; SN#:455 DEBUG: Build: 38 DEBUG: Ref Genome Mito: rCRS DEBUG: Ref Genome: hg38, by New nomenclature: hg38g DEBUG: Ref Genome File: /home/randy/WGSExtractv4/reference/genomes/hg38.fa.gz

Let us see what that indicates.

calledit commented 2 months ago

So this is what happens when I pick the "unknown" reference:

The preferred way for me would be if there was an option where the program would simply ask for the reference file.

--- STARTING: GetBAMHeader.sh        at Sun Jun  2 11:53:51 2024
+ samtools view -H --no-PG /home/me/Downloads/GFX567888_SA_L001_sorted.bam
--- FINISHED: GetBAMHeader.sh        at Sun Jun  2 11:53:51 2024 (  0 seconds to run)
DEBUG: Bam Coord Sorted? True
DEBUG: SN: Acc, M; SN#:705
DEBUG: Build:  38
DEBUG: Ref Genome Mito: rCRS
DEBUG: Ref Genome: None, by New nomenclature: None
DEBUG: Ref Genome (User): None
Exception in Tkinter callback
Traceback (most recent call last):
  File "/usr/lib/python3.12/tkinter/__init__.py", line 1967, in __call__
    return self.func(*args)
           ^^^^^^^^^^^^^^^^
  File "/home/me/Downloads/WGSExtract-Dev_latest_installer/WGSExtractv4/program/mainwindow.py", line 407, in button_select_BAM_file
    set_BAM_file(BAM_FN)
  File "/home/me/Downloads/WGSExtract-Dev_latest_installer/WGSExtractv4/program/mainwindow.py", line 447, in set_BAM_file
    wgse.BAM = BAMFile(BAM_FN)      # Process the BAM by opening file, reading header and setting up Class instance
               ^^^^^^^^^^^^^^^
  File "/home/me/Downloads/WGSExtract-Dev_latest_installer/WGSExtractv4/program/bamfiles.py", line 132, in __init__
    self._setup_BAM(BAM_FN)
  File "/home/me/Downloads/WGSExtract-Dev_latest_installer/WGSExtractv4/program/bamfiles.py", line 168, in _setup_BAM
    self.process_bam_header()       # Quick but may have error exceptions
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/Downloads/WGSExtract-Dev_latest_installer/WGSExtractv4/program/bamfiles.py", line 214, in process_bam_header
    self.determine_reference_genome()       # Only need header available to determine reference_genome
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/Downloads/WGSExtract-Dev_latest_installer/WGSExtractv4/program/bamfiles.py", line 1095, in determine_reference_genome
    tempFN = unquote(self.Refgenome_qFN)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/Downloads/WGSExtract-Dev_latest_installer/WGSExtractv4/program/utilities.py", line 109, in unquote
    return path.replace('"', '')
           ^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'replace'
RandyHarr commented 2 months ago

No matter what, the program should not exception out like that. Odd that it exists in the code. Was resolved earlier. I can give you a patch to get around this exception in the current code. See the end here.

It may still not behave well given the accession nomeclature. When non "chrN" or "N" naming is used, we have to create custom files throughout the reference/ folder and additional mappings throughout the code. The tools only work well with accession names if all the other dependencies also understand the names. But there are thousands of variations in the accession names. Too many to enumerate. Not even statistics gathering will work with accession names without a full map of them to "standard" names.. If I recall, only unalign will work with accession names and not having the reference genome specified as well.

Seems you may be comfortable editing code. Note: exactly as shown leading spaces are important in Python. Options are: utilities.py, line 109: Change to return path.replace('"', '') if path or bamfiles.py, line 1093: Change to tempFN = unquote(self.Refgenome_qFN) if self.Refgenome_qFN else ""

calledit commented 2 months ago

Names for all accessions and mappings are available on the NCBI website. In the *_assembly_report.txt files that are available for each reference.

These files: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_assembly_report.txt

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_assembly_report.txt

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.27_GRCh38.p1/GCF_000001405.27_GRCh38.p1_assembly_report.txt

And so on.

Anyway, thanks for your time!

calledit commented 2 months ago

Made a script that pulls all the accessions for all full genome references of Homo sapiens that are registered at NCBI:

#!/bin/bash
curl -O https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank_historical.txt
curl -O https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq_historical.txt

#cat assembly_summary_refseq_historical.txt assembly_summary_genbank_historical.txt |grep Homo|grep Full|cut -f 1,16
#first field is assembly_accession second is asm_name

#Urls to retrive assension info report is:
#echo https://ftp.ncbi.nlm.nih.gov/genomes/all/${assembly_accession:0:3}/${assembly_accession:4:3}/${assembly_accession:7:3}/${assembly_accession:10:3}/${assembly_accession}_${asm_name}/${assembly_accession}_${asm_name}_assembly_report.txt

#loop throgh URLS and download them
cat assembly_summary_refseq_historical.txt assembly_summary_genbank_historical.txt |grep Homo|grep Full|cut -f 1,16|while read assembly_accession asm_name
do
    curl -O https://ftp.ncbi.nlm.nih.gov/genomes/all/${assembly_accession:0:3}/${assembly_accession:4:3}/${assembly_accession:7:3}/${assembly_accession:10:3}/${assembly_accession}_${asm_name}/${assembly_accession}_${asm_name}_assembly_report.txt
done
RandyHarr commented 2 months ago

I long ago explored all the possible human reference genomes that we were likely to come across. See https://bit.ly/34CO0vj. Unfortunately we still find Ancient Ancestry university researchers that have created custom ones. We generally never find BAMs based on the actual GenBank or RefSeq checked in files.

A focus going forward will be more along the lines of piggy-backing on the EBI CRAM-file M5 sequencer server. Building custom models on the fly from what we can determine from the SAM/BAM/CRAM header. But even that is a challenge as M5 signatures are rare to find in BAM files; unfortunately. So the likelihood of guessing wrong is great.

The tool currently can generate the TSVs of all found sequence names. When I run it on the study models of above, there are over 8,000 entries. The issue is getting them to correspond to each other. LN fields are not enough as many are duplicated. Then there is the issue of different major builds having different lengths for the common human genome components of the chromosomes. The HPRC has over 50 Y chromosomes ranging in length from 40 to 85 billion base-pairs now. Finally, the tools tend to be very dependent on the names in the reference and their order. And so all the files in the reference library would have to be recreated for each variation. (Hence, see the above on dynamic creation of reference library files.)

Not sure whether we will come up with a generic solution here before it all gets replaced with pangenome functionality. Once a pangenome model is more developed and incorporated, the problem becomes larger and different as it will likely be continually patched and fixed.