WGSExtract / WGSExtract.github.io

WGS Extract WWW home
https://WGSExtract.github.io/
GNU General Public License v3.0
31 stars 5 forks source link

Command-line tool #19

Closed Bioinf-usr closed 10 months ago

Bioinf-usr commented 1 year ago

Hi,

Thank you,

Look forward to your response.

RandyHarr commented 1 year ago

(1) Command-line tool We just have not found the time to add the command-line tool option. It is one of dozens in the manual section on future ideas. Unfortunately, the focus on getting what is working better as well as expanded is consuming the current volunteer coders time. We always welcome additional coders. This would require python coding.

We started the code to add "parameters" on the command line. To at least allow people to assign file extensions to starting the tool (e.g. click on a BAM file and it brings up the tool with that file loaded). But adding parameters for all the buttons and their possible options is much more.

(2) Creating a microarray file from a VCF is just not accurate.
You need, at minimum, a gVCF file to generate a full microarray result from a VCF. The reason being that 80-90% of the values in a microarray result are homozygous for the reference. And VCF's only indicate variance from the reference. You cannot assume the value was tested and homozygous for the reference from a VCF.

A gVCF will give regions (blocks) where the base pairs were tested and homozygous to the reference. Once we start generating gVCF files we can then consider adding microarray results from them where we know when to insert the reference value or a -/0 to represent the no call in the microarray file. Those who create a microarray from a VCF are simply putting the reference value in every missing location. While this will work for the majority of the locations missing in the VCF, it is not very accurate,

When we create a microarray file from a BAM, we use a VCF template to tell the variant caller to include in the VCF specific SNPs; whether variant from the reference or not. That type of VCF is generally not found or generated. If we find generating and reading a gVCF is as fast as working from a BAM, we may simply change the tool internally to use this procedure which will thus make a gVCF to microarray file just a simpler form of the pipeline then starting from a BAM. A microarray file from a gVCF should be the same as from the original BAM. As should a microarray file from a consensus diploid FASTA as the starting point.

A microarray file from a WES gVCF result can also then be considered although the number of no calls will increase considerably.

I will leave this open to remind us of the requests. But encourage others to read here and in the user manual for the many possible extensions to the tool to make it even better. And encourage those who program Python (or want to learn) to consider tackling one of these additional features.

Bioinf-usr commented 1 year ago

Hi,

Thanks a lot for the explanation.

1) Regarding point 1, I absolutely understand and it requires considerable amount of time. 2) I agree with you regarding the minimum requirement of gVCF. We do have access to gVCF files. Which script would be a good starting point to modify to take gVCF as an input and generate the microarray files?

Here you mentioned "When we create a microarray file from a BAM, we use a VCF template to tell the variant caller to include in the VCF specific SNPs; whether variant from the reference or not. That type of VCF is generally not found or generated." is this VCF generated by you? Can it be downloaded from somewhere? Is it a VCF with all the possible SNP combinations across all the positions in a genome or is just based on your SNPs of interest?

Look forward to your response.

Thank you.

RandyHarr commented 1 year ago

I would urge you to join the consumerWGS Facebook group we mention on the home page. Lots of discussion there and better to pose inquiries there.

A VCF is traditionally variants from a reference found in the sequencing results (only). There may be multiple entries covering a site if complex variance exists.

A targeted VCF generates an entry for each entry in a supplied VCF "target" list. Whether variant or not. This is the only time you will see the 0/0 specification for the diploid value set in a VCF (assuming you set the ploidy correctly).

A gVCF adds blocks of similar quality base-pairs that are the reference. Thus giving the gaps where no sequencing or bad sequencing results exist.

A consensus (diploid) FASTA is, like a reference model, a FASTA file with one entry for each chromosome, mitochondria, etc. Usually, you do not fill in reference values for unsequenced areas; instead use N's , -'s or .'s like the the reference FASTA or VCF's would. Such a consensus FASTA has stripped all quality information away. Some prefer to simply include the reference for any value not shown as variant. But then you run into the same issue of using a VCF to generate a microarray file.

We are working to generate a gVCF and a diploid consensus FASTA for WGS results. Currently, only a microchondria consensus FASTA is generated.

We generate a targeted VCF internally. If you turn DEBUG mode on and look in the temp directory, it is left behind and you will find it there after each run to create the CombinedKit file (when it does not exist). There is a 1:1 correspondence between the targeted VCF and the CombinedKit file.

Going from the BAM to the CombinedKit file is handled in the program/microarray.py. In there you will see how it generates a shell script to go from BAM to CombinedKit. The custom shell script is left in the temp folder if DEBUG mode is on.

Each subset for a vendor and version is generated using the program/aconv.py file. But it needs to be refactored / rewritten, and pulled into the program/microarray.py code base. We cannot properly edit it, as is, to generate some of the new targets such as 1240K.