G2P plugin usage with multiple input csv

Stikus commented 5 years ago

Hello. Iв this possible to run single VEP run with with multiple input CSV files? I tried to pass 2 file args to command: $VEP --fork $PROGRAMNUMCPUS --input_file $inputVCFFile --format vcf --cache --dir_cache $VEPCACHEDIR --dir_plugins $VEPPLUGINSDIR --assembly $vepCacheAssembly --offline --output_file $vepG2PVCFFile --vcf --no_stats --force_overwrite --plugin G2P,file='$vepG2PDDCSV',file='$vepG2PCancerCSV',log_dir='$vepG2PLogDir',txt_report='$vepG2PReportTxt',html_report='$vepG2PReportHtml',af_from_vcf=1

Looks like only one working - last in list, $vepG2PCancerCSV in this case. Changing order changes result and result is always same as with only one (last in list) plugin. But iI dont see any errors.

Should I merge input CSV's or there are any other method?

at7 commented 5 years ago

Hello, the plugin can only read one file. Your best option if you want to use several files is to merge them. The plugin requires genotype data as input in your VCF file and you need to specify a list of individuals for the plugin to work: --individual [all|ind list] More help about the individual option can be found here. Best regards, Anja

Stikus commented 5 years ago

Thank you for fast answer, I'll try merging. And about individuals option - my command missed it because I dont see any info about it there and there. Both commands are simple:

./vep -i input.vcf --plugin G2P,file='DDG2P_11_7_2017.csv',log_dir='g2p_log_dir',txt_report='report.txt',html_report='report.html'

perl $HOME/src/ensembl-vep/vep \
--input_file input.vcf \
--output_file output.txt \
--force_overwrite \
--cache \
--assembly GRCh37 \
--port 3337 \
--dir_plugins $HOME/src/VEP_plugins \
--plugin G2P,file=DDG2P.csv,txt_report=report.txt,html_report=report.html,af_from_vcf=1 \

About this option - if my VCF file contain only one person data (sometimes result of germline pipeline on normal sample, sometimes result of somatic pipeline on normal and tumor samples), I should still use individual key? And can i use it without specifying list or all - simply --individual?

at7 commented 5 years ago

I will update the documentation accordingly. Thank you for pointing this out. Yes, you need to specify --individual all or --individual [sample_name].

Stikus commented 5 years ago

Thank you again. Correct me if I'm wrong - even if I have 2 samples (normal and tumor) from one person (if I correctly understand what 'individual' mean here) I still should consired them separately with their respective sample names or all key?

at7 commented 5 years ago

The plugin doesn't differentiate between normal and tumour. Each individual (sample column in a VCF file) is considered separately. The plugin highlights transcripts whose allelic requirement (e.g. biallelic or monoallelic) is fulfilled based on the overlapping genotypes in a single individual. The filtering also considers mutation consequence and allele frequencies. In your case I would recommend to use --individual all. The result could be that transcript_a has a sufficient number of overlapping variants in your tumor sample which could be linked to the observed phenotypes as described in the G2P database. But for your normal sample that transcript_a wouldn't be flagged. Please let me know if you have any more questions.

Stikus commented 5 years ago

And again thank you for detailed answer - I'll try --individual all option.

Stikus commented 5 years ago

Good evening. After your advices I modified run command to this: Merge: cat $vepG2PCancerCSV > $vepG2PMergedCSV && sed '1d' $vepG2PDDCSV >> $vepG2PMergedCSV Run: $VEP --fork $PROGRAMNUMCPUS --individual all --input_file $inputVCFFile --format vcf --cache --dir_cache $VEPCACHEDIR --dir_plugins $VEPPLUGINSDIR --assembly $vepCacheAssembly --offline --output_file $vepG2PVCFFile --vcf --no_stats --force_overwrite --plugin G2P,file='$vepG2PMergedCSV',log_dir='$vepG2PLogDir',txt_report='$vepG2PReportTxt',html_report='$vepG2PReportHtml',af_from_vcf=1.

And now I receive warnings (not every time - that is strange - first run = no warnings, 2nd = 1, 3rd = 2) about Tabix problems with indexes:

WARNING: 328 : Couldn't find index for file ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/UK10K_COHORT.20160215.sites.vcf.gz at /soft/Bio-DB-HTS-3.01/lib/perl/5.18.2/Bio/DB/HTS/Tabix.pm line 53, <__ANONIO__> line 560.
WARNING: 325 : Couldn't find index for file ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/TOPMED_GRCh37.vcf.gz at /soft/Bio-DB-HTS-3.01/lib/perl/5.18.2/Bio/DB/HTS/Tabix.pm line 53, <__ANONIO__> line 560.

According to this line - exactly af_from_vcf=1 option is reason for plugin to use Tabix,

VEP 97 release notes says that G2P can be run offline and I see some code about it: 1 and 2. But I dont fully understand why Tabix looking for index for files in Ensembl FTP (and xs-code for this is too hard to read without knowledge).

Btw - looks like Emseble FTP stores TB's of data in directory where Plugin pointing - ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/. How can plugin work offline then?

Can you help please?

Stikus commented 5 years ago

Run VEP again after switching to 98.2 release (nothing changed in code looks like, except of cache url fix) - again 2 warnings, but this time about same file:

WARNING: 328 : Couldn't find index for file ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/TOPMED_GRCh37.vcf.gz at /soft/Bio-DB-HTS-3.01/lib/perl/5.18.2/Bio/DB/HTS/Tabix.pm line 53, <__ANONIO__> line 560.
WARNING: 323 : Couldn't find index for file ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/TOPMED_GRCh37.vcf.gz at /soft/Bio-DB-HTS-3.01/lib/perl/5.18.2/Bio/DB/HTS/Tabix.pm line 53, <__ANONIO__> line 560.

Very strange for me.

at7 commented 5 years ago

Hello, The plugin looks up allele frequencies from the cache files. We store exome gnomAD, ESP and 1000 Genomes frequency data in our cache files. The plugin can also look up frequencies from VCF files. Look up from VCF files is switched on by setting af_from_vcf=1. Until recently we needed a database connection to set up the ensembl-variation API to allow look up from VCF files. This requirement has now been removed and it is possible to look up frequencies in offline mode. I'm not sure why you are getting the warning message and I will need to do some further investigation into the cause of it. You could run without af_from_vcf=1 in the meantime. How did you install VEP? Did you run the INSTALL.pl script and did you install htslib in the installation process? Thank you.

Stikus commented 5 years ago

I'll try to run without af_from_vcf until problem will resolved, thanks. About our installation - there is my issue with installation commands. Our VEP installation command: perl INSTALL.pl --NO_TEST --NO_UPDATE --NO_HTSLIB --NO_BIOPERL -a ap --PLUGINSDIR "$SOFT/ensembl-vep-${VEP_VERSION}/Plugins" --PLUGINS ProteinSeqs,Downstream,Conservation,GO,G2P

Our VEP cache install command: $VEPINSTALL --NO_TEST --NO_UPDATE --NO_HTSLIB --NO_BIOPERL -a cf -s homo_sapiens -y $vepCacheAssembly --CONVERT -c $VEPCACHEDIR

We installing HTSLIB separately and before VEP:

# samtools 1.9 & htslib 1.9
RUN cd "$SOFT" \
    && wget -q "https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2" -O "$SOFT/samtools-1.9.tar.bz2" \
    && tar -xjf "$SOFT/samtools-1.9.tar.bz2" \
    && mv "$SOFT/samtools-1.9" "$SOFT/samtools-1.9-src" \
    && cd "$SOFT/samtools-1.9-src/htslib-1.9" \
    && ./configure --prefix="$SOFT/htslib-1.9" --enable-libcurl --enable-plugins --with-libdeflate CFLAGS="-fPIC -O3" CPPFLAGS="-I$LIBDEFLATE_ROOT/include" LDFLAGS="-L$LIBDEFLATE_ROOT/lib" \
    && make -j"$(($(nproc)+1))" \
    && make install \
    && cd "$SOFT/samtools-1.9-src" \
    && ./configure --prefix="$SOFT/samtools-1.9" --with-htslib="$SOFT/htslib-1.9" CFLAGS="-g -O3 -fPIC" \
    && make -j"$(($(nproc)+1))" \
    && make install \
    && cd "$SOFT" \
    && rm -r "$SOFT/samtools-1.9-src" \
    && rm "$SOFT/samtools-1.9.tar.bz2"
ENV SAMTOOLS="$SOFT/samtools-1.9/bin/samtools" \
    BGZIP="$SOFT/htslib-1.9/bin/bgzip" \
    TABIX="$SOFT/htslib-1.9/bin/tabix" \
    PATH="$SOFT/samtools-1.9/bin:$SOFT/htslib-1.9/bin:$PATH" \
    LD_LIBRARY_PATH="$SOFT/htslib-1.9/lib:$LD_LIBRARY_PATH" \
    HTSLIB_DIR="$SOFT/htslib-1.9"

If you need any information - I'll try to provide it.

at7 commented 5 years ago

Your installation looks absolutely fine. However, I'm still not able to reproduce the error. I'm inquiring if we had any problems with our FTP site today. If you are still having the same warning messages and want to use additional frequencies from VCF for your annotations I recommend that you download the files locally. You need to update a config file so the plugin will know where to locate the VCF files. The config file should be under PATH_TO/ensembl-vep/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json. You need to update a few things in the config file: Search for topmed_GRCh37. And then update the highlighted sections: { "id": "topmed_GRCh37", "species": "homo_sapiens", "assembly": "GRCh37", "type": "local", "filename_template": "YOUR_FILE_LOCATION/TOPMED_GRCh37.vcf.gz", "chromosomes": [ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y" ], "source_name": "TOPMed", "population_display_group": { "display_group_name": "TOPMed", "display_group_priority": 2.5 }, "populations": { "9990000": { "name": "TOPMed", "_raw_name": "TOPMed", "_af": "TOPMED", "description": "Trans-Omics for Precision Medicine (TOPMed) Program" } } },

You also need to update: uk10k_GRCh37 and gnomADg_GRCh37. Having the files locally also gives you a better overall performance in terms of runtime. Please let me know if you have any questions related to setting up the files locally.

Stikus commented 5 years ago

Thanks for detailed answer as always :) Few more questions:

We are using both GRCh37 and GRCh38 gemomes - do I need to change config sections for:
- gnomADg_GRCh37
- gnomADg_GRCh38
- topmed_GRCh37
- topmed_GRCh38
- uk10k_GRCh37
- uk10k_GRCh38 What about:
- 1000genomes_phase3_GRCh37
- 1000genomes_phase3_GRCh38
- gnomADe_GRCh37
- gnomADe_GRCh38
- CADD_GRCh38_whole_genome_SNVs
We need to store needed files locally and switch type: remote to type: local. Is it enough for plugin to correctly use local database with --offline option for VEP itself?
We using docker and config will be stored in root directory. But sometimes we run docker with user that dont have rights for change files in root dir (with CWL especially). Reference directory with VEP cache can be passed to VEP at runtime with --dir_cache option with no problem, but hardcoding path in config wont work for us. Is there any option to pass directory with files to plugin on to pass temporary created config file? Or maybe option to use environmental variables in config?

at7 commented 5 years ago

If you set af_from_vcf=1 the plugin will retrieve additional frequencies for gnomADe, gnomADg, topmed and uk10k from VCF files. You can specify which data sets you want use with af_from_vcf_keys for example af_from_vcf_keys=topmed&gnomADg. This is documented in the plugin header. We probably will remove gnomADe from the list of supported data sets because the frequencies are stored in the cache files. We had some problems initially storing some of the gnomADe frequencies in our cache files. This has since been corrected. To safe some time just specifiy af_from_vcf_keys=topmed&gnomADg&uk10k. You only need to update the vcf_config file for data sets which you want to and can use in your analysis.

If you download any of the datasets you need to switch remote to local for those datasets.

I recommend to use --offline. In that case you also have to provide a FASTA file with --fasta. The reason for that is that the G2P plugin returns a report which contains HGVS notation for your input variants. Generation of HGVS notation either needs a connection to the database for sequence lookup or a FASTA file. I will update the documentation accordingly.

You can set the environment variable ENSEMBL_VARIATION_VCF_CONFIG_FILE to point to your config file and the API will check first if the variable is set before falling back to the default location PATH_TO/ensembl-vep/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json.

Stikus commented 5 years ago

Existing of ENSEMBL_VARIATION_VCF_CONFIG_FILE is great news - it will help us, thank you.

About --fasta - according to this page we dont need to specify --fasta if we have default path and --cache - or I missunderstood something?

at7 commented 5 years ago

You are absolutely right. You don't need to specify the fasta file location if you have run the installation script with the fasta option.

Stikus commented 5 years ago

Thanks for clarification - I'll try to run plugin with options discussed and report there if I have any troubles. Maybe today or tomorrow.

Stikus commented 5 years ago

@at7 Good evening. I'm trying to implement local database for plugin but looks like this folder contains 350+ GB of data - little bit too much for our network transfer :) Do we need this full directory, or this file is enough?

at7 commented 5 years ago

This really depends on your use case. G2P filters by gene and would therefore exclude most of the variants from the gnomAD genomes VCF files. The plugin would always move your input variants into the result set after passing all the other filtering criteria and if there isn't any frequency available for filtering. Therefore by not using the gnomAD genome VCF files you could end up with more variants in your result set which you could check separately against gnomAD. The full directory reflects the latest gnomAD version 2.1.

Ensembl / VEP_plugins

G2P plugin usage with multiple input csv #249