Closed rituparna-13 closed 2 years ago
You need need to supply a reference Fasta file in the first step as well.
Create VCI file from VCF file(s)
Usage: g2gtools vcf2vci [-i <indel VCF file>]* -s <strain> -o <output file> [options]
Required Parameters:
-i, --vcf <vcf_file> VCF file name
-f, --fasta <Fasta File> Fasta file matching VCF information
-s, --strain <Strain> Name of strain (column in VCF file)
-o, --output <Output file> VCI file name to create
Optional Parameters:
-p, --num-processes <number> The number of processes to use, defaults to the number of cores
--diploid Create diploid VCI
--keep Keep track of VCF lines that could not be converted to VCI file
--pass Use only VCF lines that have a PASS for the filter value
--quality Filter on quality, FI=PASS
--no-bgzip DO NOT compress and index output
Help Parameters:
-h, --help Print the help and exit
-d, --debug Turn debugging mode on (list multiple times for more messages)
Thank you, Matt for the information and a quick response. I was able to execute the command for conversion.
I was having the same issue until I found this thread. Could you please clarify one thing for me?
is the fasta file a "reference genome" or is it from the sample for which the .vcf is being input?
Hi @dhadsell a fasta file is a reference genome. For example, the reference fasta of hg38 or hg19 depending on which one you use to get your bam and vcf files.
Thank you for the clarification.
I was using the mm38 reference for mouse (Mus_musculus.GRCm38.dna.primary_assembly.fa) in conjunction with a .vcf for the inbred mouse strain PL/J. The concerned was that I needed an actual .fa for PL_J. What I am trying to do is build a pseudo genome for PL/J and eventually a pooled transcriptome for C57BL/6J x PL/J. I have gone back and forth between the documentation for g2gtools and another package call EMASE.
So using the GRCm38.fa with a PL_J.vcf and a GRCm38.gtf I have used "vcf2vci", "patched", "transform", and "convert" to create the files "PL_J.transformed.fa" and "PL_J.gtf"
right now I am concerned about this approach because I also got a "PL_J.gtf.unmapped" with about 29,000 entries. My next step was going to be in EMASE using:
create-hybrid -G ${GRCm38.fa},${PL_J.transformed.fa} -g ${GRCm38.gtf},${PL_J.gtf} \ -s ${SUFFIX1},${SUFFIX2} -o ${EMASE_DIR}
I am just concerned that my PL_J.transformed.fa may not be the right one.
Am I on the right track here or is there something I have missed?
I don’t have much experience to comment on this as I am still learning. Like you I too had struggled between g2gtools and EMASE and finally found that I need g2gtools first to get a diploid genome which can be used in EMASE. I think my problem is different from yours and not in my expertise. You can try executing your command to check if it works or post your issue in EMASE's gitHub page (https://github.com/churchill-lab/emase/issues). Hope this helps.
I executed the following command to convert vcf to vci following the documentation:
g2gtools vcf2vci -o DEMO1_NEXTERAlib.vci -s DEMO1_NEXTERAlibrary_130508_8 --diploid -p 8 -i DEMO1_NEXTERAlib.vcf
I wanted to test for a single vcf file first because of which I have passed a single vcf file as input. I get the following output after executing the above command:
[g2gtools] VCF file: /gpfs/research/fangroup/rk18g/EGA_UCF1014_datasets/bulk_RNA/DEMO1_NEXTERAlib/DEMO1_NEXTERAlib.vcf [g2gtools] Checking for index file, creating if needed... No fasta file was specified.
From the documentation it doesn't seem that we need any fasta file for this first step. Can you please help me in understanding if I am doing anything wrong?
I am using g2gtools version 0.2.7 and executing it from CentOS.