churchill-lab / g2gtools

Personal diploid genome creation and coordinate conversion
http://churchill-lab.github.io/g2gtools
21 stars 9 forks source link

Is fastfa file required for vcf2vci step? #25

Closed rituparna-13 closed 2 years ago

rituparna-13 commented 2 years ago

I executed the following command to convert vcf to vci following the documentation: g2gtools vcf2vci -o DEMO1_NEXTERAlib.vci -s DEMO1_NEXTERAlibrary_130508_8 --diploid -p 8 -i DEMO1_NEXTERAlib.vcf

I wanted to test for a single vcf file first because of which I have passed a single vcf file as input. I get the following output after executing the above command:

[g2gtools] VCF file: /gpfs/research/fangroup/rk18g/EGA_UCF1014_datasets/bulk_RNA/DEMO1_NEXTERAlib/DEMO1_NEXTERAlib.vcf [g2gtools] Checking for index file, creating if needed... No fasta file was specified.

From the documentation it doesn't seem that we need any fasta file for this first step. Can you please help me in understanding if I am doing anything wrong?

I am using g2gtools version 0.2.7 and executing it from CentOS.

mattjvincent commented 2 years ago

You need need to supply a reference Fasta file in the first step as well.

   Create VCI file from VCF file(s)

    Usage: g2gtools vcf2vci [-i <indel VCF file>]* -s <strain> -o <output file> [options]

    Required Parameters:
        -i, --vcf <vcf_file>             VCF file name
        -f, --fasta <Fasta File>         Fasta file matching VCF information
        -s, --strain <Strain>            Name of strain (column in VCF file)
        -o, --output <Output file>       VCI file name to create

    Optional Parameters:
        -p, --num-processes <number>     The number of processes to use, defaults to the number of cores
        --diploid                        Create diploid VCI
        --keep                           Keep track of VCF lines that could not be converted to VCI file
        --pass                           Use only VCF lines that have a PASS for the filter value
        --quality                        Filter on quality, FI=PASS
        --no-bgzip                       DO NOT compress and index output

    Help Parameters:
        -h, --help                       Print the help and exit
        -d, --debug                      Turn debugging mode on (list multiple times for more messages)
rituparna-13 commented 2 years ago

Thank you, Matt for the information and a quick response. I was able to execute the command for conversion.

dhadsell commented 2 years ago

I was having the same issue until I found this thread. Could you please clarify one thing for me?

is the fasta file a "reference genome" or is it from the sample for which the .vcf is being input?

rituparna-13 commented 2 years ago

Hi @dhadsell a fasta file is a reference genome. For example, the reference fasta of hg38 or hg19 depending on which one you use to get your bam and vcf files.

dhadsell commented 2 years ago

Thank you for the clarification.

I was using the mm38 reference for mouse (Mus_musculus.GRCm38.dna.primary_assembly.fa) in conjunction with a .vcf for the inbred mouse strain PL/J. The concerned was that I needed an actual .fa for PL_J. What I am trying to do is build a pseudo genome for PL/J and eventually a pooled transcriptome for C57BL/6J x PL/J. I have gone back and forth between the documentation for g2gtools and another package call EMASE.

So using the GRCm38.fa with a PL_J.vcf and a GRCm38.gtf I have used "vcf2vci", "patched", "transform", and "convert" to create the files "PL_J.transformed.fa" and "PL_J.gtf"

right now I am concerned about this approach because I also got a "PL_J.gtf.unmapped" with about 29,000 entries. My next step was going to be in EMASE using:

create-hybrid -G ${GRCm38.fa},${PL_J.transformed.fa} -g ${GRCm38.gtf},${PL_J.gtf} \ -s ${SUFFIX1},${SUFFIX2} -o ${EMASE_DIR}

I am just concerned that my PL_J.transformed.fa may not be the right one.

Am I on the right track here or is there something I have missed?

rituparna-13 commented 2 years ago

I don’t have much experience to comment on this as I am still learning. Like you I too had struggled between g2gtools and EMASE and finally found that I need g2gtools first to get a diploid genome which can be used in EMASE. I think my problem is different from yours and not in my expertise. You can try executing your command to check if it works or post your issue in EMASE's gitHub page (https://github.com/churchill-lab/emase/issues). Hope this helps.