freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
143 stars 24 forks source link

No output files generated for Illunmina reports #40

Closed rashindrie closed 3 years ago

rashindrie commented 3 years ago

Hello,

I have tried to use the following command to convert Illumina reports to VCF.

bcftools +gtc2vcf --genome-studio FinalReport24.txt -o GenotypeReport24.vcf

Output from the run in the terminal is only one line:

gtc2vcf 2021-06-01 https://github.com/freeseek/gtc2vcf

And the GenotypeReport24.vcf file is created but with no contents in it.

An extract from the Illumina report:

[Header]
GSGT Version    2.0.4
Processing Date 3/29/2021 4:13 PM
Content     GSA-24v3-0_A2.bpm
Num SNPs    654027
Total SNPs  654027
Num Samples 24
Total Samples   24
File    24 of 24
[Data]
Sample Index    Sample ID   Sample Name SNP Index   SNP Name    Chr Position    GT Score    GC Score    Allele1 - AB    Allele2 - AB    Allele1 - Top   Allele2 - Top   Allele1 - Forward   Allele2 - Forward   Allele1 - Design    Allele2 - Design    Theta   R   X Raw   Y Raw   X   Y   B Allele Freq   Log R Ratio SNP Aux SNP ILMN Strand Top Genomic Sequence    Customer Strand
24  03-031      1   1:103380393 1   102914837   0.7987  0.8136  B   B   G   G   G   G   C   C   0.963   0.722   1101    3453    0.040   0.682   1.0000  0.3609  0   [T/C]   BOT     TOP
24  03-031      2   1:109439680 1   108897058   0.8792  0.4803  A   A   A   A   A   A   A   A   0.039   0.895   11409   497 0.843   0.052   0.0000  0.4173  0   [A/G]   TOP     TOP

I spent some hours trying to figure out what i might be doing wrong but couldn't figure it out. Any tips on what might be going wrong with my steps is appreciated.

Thanks, Rashindrie

Update

Tried with below command

ref="/tmp/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
bcftools +gtc2vcf   --no-version -Ov -o $out_prefix  --genome-studio "FinalReport24.txt" -f $ref

Output on terminal

gtc2vcf 2021-06-01 https://github.com/freeseek/gtc2vcf
Writing VCF file
Could not recognize INFO field: [Header]
freeseek commented 3 years ago

Thank you. I will add an error message when trying to input GenomeStudio file without a reference genome. The problem with your GenomeStudio file is that it contains a header. But there is also the problem that your GenomeStudio file is in "long" format, that is, each line contains information for one sample at one marker rather than being in "matrix" format where each line contains information for all samples at one marker. This will not convert to VCF as it cannot be parsed as a stream. An example GenomeStudio file that can be converted looks as follows:

Chromosome      Position        IlmnStrand      SNP     Name    SM.GType        SM.Score        SM.Theta        SM.R    SM.B Allele Freq        SM.Log R Ratio
9       139906359       BOT     [T/C]   200003  AA      0.9299  0.029   1.300   0.0027  0.2150
9       139926402       TOP     [A/G]   200006  AB      0.7877  0.435   2.675   0.4742  0.3024
2       220084902       BOT     [T/C]   200047  AA      0.8612  0.083   0.476   0.0532  -0.1173
2       220089685       TOP     [C/G]   200050  BB      0.8331  0.995   1.499   1.0000  0.3068

Which could have multiple samples by adding more columns. My advice is to retrieve the IDAT (or GTC) files, which are files with a proper format specification, and follow the steps to generate a VCF file from there.

rashindrie commented 3 years ago

Thank you for your reply @freeseek .

As you've suggested, I will try to get the IDAT/GTC files.

rashindrie commented 3 years ago

Hi @freeseek ,

I have a another question, I have generated signal Intensity files using the Illumina Final reports. Would they work as input to generate VCF files?

Name    Chr Position    27-017.Log R Ratio  27-017.B Allele Freq
1:103380393 1   102914837   0.4431  1.0000
1:109439680 1   108897058   0.2632  0.0000
1:110198788 1   109656166   0.0063  0.0131
1:110201112 1   109658490   -0.4072 0.0000
1:110201667 1   109659045   -0.2497 1.0000
1:110202904 1   109660282   -0.0134 0.0000
1:110203240 1   109660618   -0.0141 0.9975
1:110203911 1   109661289   -0.0750 1.0000
1:110206675 1   109664053   -0.1057 0.0086

Thanks, Rashindrie

freeseek commented 3 years ago

Almost. That table in theory could convert fine into a VCF with a single sample. It is missing the required GType column though and it is impossible to infer what the alternate allele should be at each position from those columns alone. Furthermore, I would not advise to convert data into single sample VCFs though, as it is computationally expensive to then merge single sample VCFs. Finally, one of the reasons I advise against converting data from GenomeStudio tables is that it is not possible to use +gtc2vcf infrastructure to align against the human genome reference of choice (many GenomeStudio tables have GRCh37 coordinates and GRCh38 should be the reference of choice at the moment).

rashindrie commented 3 years ago

Thanks for your reply. I'll look into what I can do :)

rashindrie commented 3 years ago

Hi @freeseek ,

I have another question. I was able to obtain the IDAT files but I am now getting a "Segmentation Fault" error.

Command:

bpm_manifest_file="/home/ubuntu/snp_analysis/data/24v3-0_A2.bpm"
csv_manifest_file="/home/ubuntu/snp_analysis/data/SampleSheet.csv"
egt_cluster_file="/home/ubuntu/snp_analysis/data/24v3-0_A1_ClusterFile.egt"
ref="/tmp/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"

bcftools +gtc2vcf --bpm $bpm_manifest_file --egt $egt_cluster_file --fasta-ref $ref -i -g $path_to_idat_folder

Output:

gtc2vcf 2021-06-01 https://github.com/freeseek/gtc2vcf
Reading BPM file /home/ubuntu/snp_analysis/data/24v3-0_A2.bpm
Reading EGT file /home/ubuntu/snp_analysis/data/24v3-0_A1_ClusterFile.egt
Reading IDAT file /home/ubuntu/snp_analysis/data/205058610011//205058610011_R08C02_Red.idat
Reading IDAT file /home/ubuntu/snp_analysis/data/205058610011//205058610011_R02C02_Red.idat
Reading IDAT file /home/ubuntu/snp_analysis/data/205058610011//205058610011_R09C01_Red.idat
Reading IDAT file /home/ubuntu/snp_analysis/data/205058610011//205058610011_R09C01_Grn.idat
Reading IDAT file /home/ubuntu/snp_analysis/data/205058610011//205058610011_R05C02_Red.idat
.....
.....
Writing VCF file
Segmentation fault (core dumped)

Is this caused of a memory issue?

Thanks, Rashindrie

freeseek commented 3 years ago

This is clearly a bug but you should not use the tool like that. bcftools +gtc2vcf can only input IDATs to extract the metadata from them (this can be useful to understand which manifest files to use). When you input IDATs, you should not input anything else (and the tool should have thrown an exception). The proper use of the tool should be as follows:

bcftools +gtc2vcf --bpm $bpm_manifest_file --egt $egt_cluster_file --fasta-ref $ref -g $path_to_gtc_folder

(with optionally --csv $csv_manifest_file to also allow indels to be properly encoded)

rashindrie commented 3 years ago

Oh I see.. Thanks for pointing that out.

So following your last comment, I changed the workflow.

Now, I am using the IDAT files to first create the GTC files through the following command:

LANG="en_US.UTF-8" $HOME/bin/iaap-cli/iaap-cli gencall   $bpm_manifest_file   $egt_cluster_file   $path_to_output_folder   --idat-folder $path_to_idat_folder   --output-gtc   --gender-estimate-call-rate-threshold -0.1

Then I used the GTC files to get VCF files as mentioned in the Readme.

out_prefix="/home/ubuntu/snp_analysis/vcf/ILGSA"
bcftools +gtc2vcf \
  --no-version -Ou \
  --bpm $bpm_manifest_file \
  --csv $csv_manifest_file \
  --egt $egt_cluster_file \
  --gtcs $path_to_gtc_folder \
  --fasta-ref $ref \
  --extra $out_prefix.tsv | \
  bcftools sort -Ou -T ./bcftools-sort.XXXXXX | \
  bcftools norm --no-version -Ob -o $out_prefix.bcf -c x -f $ref && \
  bcftools index -f $out_prefix.bcf

The .bcf and .tsv files are now created!

rashindrie commented 3 years ago

Hi @freeseek ,

I am now trying to use the Mocha pipeline with the generated BCF file.

However, I am not entirely clear as to if I should generate one BCF file as I have done through the above command or create individual VCF files for each GTC file to use in mocha.

Could you please point me in the right direction.

Thanks, Rashindrie

freeseek commented 3 years ago

Generating single sample BCF files is not a good idea. There is nothing wrong in principle, but it is a huge waste of computational resources as it causes conversion computations to be repeated. Furthermore it is computationally intensive to merge multiple single sample BCFs. It just makes way more sense to convert multiple GTC files at once into one BCF file with multiple samples.

rashindrie commented 3 years ago

So, I am confused.

When I give --gtcs $path_to_gtc_folder into the bcftools +gtc2vcf command, the BCF file generated is converting multiple GTC files into one BCF sample right?

freeseek commented 3 years ago

Yes, when you use --gtcs $path_to_gtc_folder you end up converting multiple GTCs into a single VCF/BCF file with one sample for each GTC file you input.

rashindrie commented 3 years ago

Thanks, can I use the BCF file generated in such a manner directly in Mocha pipeline?

freeseek commented 3 years ago

In theory you can use the IDATs as input in the MoChA pipeline. If your question is whether you can use the VCF output from gtc2vcf as input to the BCFtools MoChA plugin, then no, as the VCF would first need to be phased. Follow the instructions in the documentation to know what you need to do.

rashindrie commented 3 years ago

Oh I was not aware that we can directly use IDAT on the MochA pipeline. I'll check the docs again.