Closed rashindrie closed 3 years ago
Thank you. I will add an error message when trying to input GenomeStudio file without a reference genome. The problem with your GenomeStudio file is that it contains a header. But there is also the problem that your GenomeStudio file is in "long" format, that is, each line contains information for one sample at one marker rather than being in "matrix" format where each line contains information for all samples at one marker. This will not convert to VCF as it cannot be parsed as a stream. An example GenomeStudio file that can be converted looks as follows:
Chromosome Position IlmnStrand SNP Name SM.GType SM.Score SM.Theta SM.R SM.B Allele Freq SM.Log R Ratio
9 139906359 BOT [T/C] 200003 AA 0.9299 0.029 1.300 0.0027 0.2150
9 139926402 TOP [A/G] 200006 AB 0.7877 0.435 2.675 0.4742 0.3024
2 220084902 BOT [T/C] 200047 AA 0.8612 0.083 0.476 0.0532 -0.1173
2 220089685 TOP [C/G] 200050 BB 0.8331 0.995 1.499 1.0000 0.3068
Which could have multiple samples by adding more columns. My advice is to retrieve the IDAT (or GTC) files, which are files with a proper format specification, and follow the steps to generate a VCF file from there.
Thank you for your reply @freeseek .
As you've suggested, I will try to get the IDAT/GTC files.
Hi @freeseek ,
I have a another question, I have generated signal Intensity files using the Illumina Final reports. Would they work as input to generate VCF files?
Name Chr Position 27-017.Log R Ratio 27-017.B Allele Freq
1:103380393 1 102914837 0.4431 1.0000
1:109439680 1 108897058 0.2632 0.0000
1:110198788 1 109656166 0.0063 0.0131
1:110201112 1 109658490 -0.4072 0.0000
1:110201667 1 109659045 -0.2497 1.0000
1:110202904 1 109660282 -0.0134 0.0000
1:110203240 1 109660618 -0.0141 0.9975
1:110203911 1 109661289 -0.0750 1.0000
1:110206675 1 109664053 -0.1057 0.0086
Thanks, Rashindrie
Almost. That table in theory could convert fine into a VCF with a single sample. It is missing the required GType column though and it is impossible to infer what the alternate allele should be at each position from those columns alone. Furthermore, I would not advise to convert data into single sample VCFs though, as it is computationally expensive to then merge single sample VCFs. Finally, one of the reasons I advise against converting data from GenomeStudio tables is that it is not possible to use +gtc2vcf infrastructure to align against the human genome reference of choice (many GenomeStudio tables have GRCh37 coordinates and GRCh38 should be the reference of choice at the moment).
Thanks for your reply. I'll look into what I can do :)
Hi @freeseek ,
I have another question. I was able to obtain the IDAT files but I am now getting a "Segmentation Fault" error.
Command:
bpm_manifest_file="/home/ubuntu/snp_analysis/data/24v3-0_A2.bpm"
csv_manifest_file="/home/ubuntu/snp_analysis/data/SampleSheet.csv"
egt_cluster_file="/home/ubuntu/snp_analysis/data/24v3-0_A1_ClusterFile.egt"
ref="/tmp/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
bcftools +gtc2vcf --bpm $bpm_manifest_file --egt $egt_cluster_file --fasta-ref $ref -i -g $path_to_idat_folder
Output:
gtc2vcf 2021-06-01 https://github.com/freeseek/gtc2vcf
Reading BPM file /home/ubuntu/snp_analysis/data/24v3-0_A2.bpm
Reading EGT file /home/ubuntu/snp_analysis/data/24v3-0_A1_ClusterFile.egt
Reading IDAT file /home/ubuntu/snp_analysis/data/205058610011//205058610011_R08C02_Red.idat
Reading IDAT file /home/ubuntu/snp_analysis/data/205058610011//205058610011_R02C02_Red.idat
Reading IDAT file /home/ubuntu/snp_analysis/data/205058610011//205058610011_R09C01_Red.idat
Reading IDAT file /home/ubuntu/snp_analysis/data/205058610011//205058610011_R09C01_Grn.idat
Reading IDAT file /home/ubuntu/snp_analysis/data/205058610011//205058610011_R05C02_Red.idat
.....
.....
Writing VCF file
Segmentation fault (core dumped)
Is this caused of a memory issue?
Thanks, Rashindrie
This is clearly a bug but you should not use the tool like that. bcftools +gtc2vcf
can only input IDATs to extract the metadata from them (this can be useful to understand which manifest files to use). When you input IDATs, you should not input anything else (and the tool should have thrown an exception). The proper use of the tool should be as follows:
bcftools +gtc2vcf --bpm $bpm_manifest_file --egt $egt_cluster_file --fasta-ref $ref -g $path_to_gtc_folder
(with optionally --csv $csv_manifest_file
to also allow indels to be properly encoded)
Oh I see.. Thanks for pointing that out.
So following your last comment, I changed the workflow.
Now, I am using the IDAT files to first create the GTC files through the following command:
LANG="en_US.UTF-8" $HOME/bin/iaap-cli/iaap-cli gencall $bpm_manifest_file $egt_cluster_file $path_to_output_folder --idat-folder $path_to_idat_folder --output-gtc --gender-estimate-call-rate-threshold -0.1
Then I used the GTC files to get VCF files as mentioned in the Readme.
out_prefix="/home/ubuntu/snp_analysis/vcf/ILGSA"
bcftools +gtc2vcf \
--no-version -Ou \
--bpm $bpm_manifest_file \
--csv $csv_manifest_file \
--egt $egt_cluster_file \
--gtcs $path_to_gtc_folder \
--fasta-ref $ref \
--extra $out_prefix.tsv | \
bcftools sort -Ou -T ./bcftools-sort.XXXXXX | \
bcftools norm --no-version -Ob -o $out_prefix.bcf -c x -f $ref && \
bcftools index -f $out_prefix.bcf
The .bcf
and .tsv
files are now created!
Hi @freeseek ,
I am now trying to use the Mocha pipeline with the generated BCF file.
However, I am not entirely clear as to if I should generate one BCF file as I have done through the above command or create individual VCF files for each GTC file to use in mocha.
Could you please point me in the right direction.
Thanks, Rashindrie
Generating single sample BCF files is not a good idea. There is nothing wrong in principle, but it is a huge waste of computational resources as it causes conversion computations to be repeated. Furthermore it is computationally intensive to merge multiple single sample BCFs. It just makes way more sense to convert multiple GTC files at once into one BCF file with multiple samples.
So, I am confused.
When I give --gtcs $path_to_gtc_folder
into the bcftools +gtc2vcf
command, the BCF file generated is converting multiple GTC files into one BCF sample right?
Yes, when you use --gtcs $path_to_gtc_folder
you end up converting multiple GTCs into a single VCF/BCF file with one sample for each GTC file you input.
Thanks, can I use the BCF file generated in such a manner directly in Mocha pipeline?
In theory you can use the IDATs as input in the MoChA pipeline. If your question is whether you can use the VCF output from gtc2vcf as input to the BCFtools MoChA plugin, then no, as the VCF would first need to be phased. Follow the instructions in the documentation to know what you need to do.
Oh I was not aware that we can directly use IDAT on the MochA pipeline. I'll check the docs again.
Hello,
I have tried to use the following command to convert Illumina reports to VCF.
Output from the run in the terminal is only one line:
And the
GenotypeReport24.vcf
file is created but with no contents in it.An extract from the Illumina report:
I spent some hours trying to figure out what i might be doing wrong but couldn't figure it out. Any tips on what might be going wrong with my steps is appreciated.
Thanks, Rashindrie
Update
Tried with below command
Output on terminal