EGA-archive / beacon2-ri-tools-v2

Apache License 2.0
1 stars 1 forks source link

Beacon 2 RI tools v2.0

This repository contains the new Beacon ri tools v2.0, a software created with the main goal of generating BFF data from .csv or .vcf (and probably more types of datafiles in the future). This is based on the first beacon ri tools, a previous and different version that you can find here: Beacon ri tools v1. The new features for beacon v2.0 are:

Data conversion process

The main goal of Beacon ri tools v2.0 is to obtain a BFF (json following Beacon v2 official specifications) file that can be injected to a beacon v2 mongoDB database. To obtain a beacon v2 with its mongodb and see how to inject this BFF files, you can check it out and download yours for free at the official repo of Beacon v2 ri api. To get this json file, you can either convert your data from a .vcf file or from a .csv file. Please, see instruction manual to follow the right steps to do the data conversion. At the end, you will end completing one of the possible conversion processes that is shown in the next diagram: Beacon tools v2 diagram

Installation guide with docker

First of all, clone or download the repository to your computer:

git clone https://github.com/EGA-archive/beacon2-ri-tools-v2.git

To light up the container with beacon ri tools v2, execute the next command inside the root folder:

docker-compose up -d --build

Once the container is up and running you can start using beacon ri tools v2, congratulations!

Instruction manual

Setting configuration and csv file

To start using beacon ri tools v2, you have to edit the configuration file conf.py that you will find inside conf. Inside this file you will find the next information:

#### Input and Output files config parameters ####
csv_filename='csv/examples/cohorts.csv'
output_docs_folder='output_docs/CINECA_dataset/'

#### VCF Conversion config parameters ####
num_variants=100000
reference_genome='GRCh37' # Choose one between NCBI36, GRCh37, GRCh38

Generic config parameters

The csv_filename variable sets where is the .csv file the script will write and read data from. This .csv file needs to have the headers written as you can find in the files inside templates. Note that any header that has a different name from the ones that appear inside the files of this folder will not be read by the beacon ri tools v2. The output_docs_folder sets the folder where your final .json files will be saved once execution of beacon tools finishes. This folder is mandatory to be always inside 'output_docs', so only the subdirectory inside output_docs can be modified in this path.

VCF conversion config parameters

The num_variants is the variable you need to write in case you are executing the vcf conversor (genomicVariations_vcf.py). This will tell the script how many vcf lines will be read and converted from the file(s). The reference_genome is the genome reference your the tool is using to map the position of the chromosomes. The allele_frequency let's you set a threshold for the allele frequency of the variants you want to convert from the vcf file.

Converting data from .vcf.gz file

To convert data from .vcf.gz to .json, you will need to copy all the files you want to convert inside the files_to_read folder. You will need to provide one .vcf.gz file file and save it in this folder.

docker exec -it ri-tools python genomicVariations_vcf.py

After that, if needed, export your documents from mongoDB to a .json file using two possible commands. The first one will delete "_id" entries generated by mongoDB:

docker exec ri-tools-mongo mongoexport --jsonArray --uri "mongodb://root:example@127.0.0.1:27017/beacon?authSource=admin" --collection genomicVariations | sed '/"_id":/s/"_id":[^,]*,//g' > genomicVariations.json

The second one will keep the "_id" entries generated by mongoDB:

docker exec ri-tools-mongo mongoexport --jsonArray --uri "mongodb://root:example@127.0.0.1:27017/beacon?authSource=admin" --collection genomicVariations > genomicVariations.json

This will generate the final .json file that is Beacon Friendly Format. Bear in mind that this time, the file will be saved in the directory you are located, so if you want to save it in the output_docs folder, add it in the path of the mongoexport.

Creating the .csv file (if metadata or not having a vcf file for genomicVariations)

If you want to convert metadata into BFF or fill a genomicVariations csv to convert to json, you will have to create a .csv file writing the records according to the header columns, which indicate the field of the schema that this data will be placed in. Every new row will be appended to the final output file as a new and independent document. Fill in the csv file, following the next rules:

Getting .json final documents

Before getting the .json final documents, please make sure your conf.py that you will find inside conf file is reading the right .csv document and execute the next bash script from the root folder in your terminal (for the collection you have chosen, in this case for genomic Variations):

docker exec -it ri-tools python genomicVariations_csv.py

All the possible scripts you can execute (individually) to convert csv data for each collection are:

docker exec -it ri-tools python analyses_csv.py
docker exec -it ri-tools python biosamples_csv.py
docker exec -it ri-tools python cohorts_csv.py
docker exec -it ri-tools python datasets_csv.py
docker exec -it ri-tools python genomicVariations_csv.py
docker exec -it ri-tools python individuals_csv.py
docker exec -it ri-tools python runs_csv.py

Once you execute one of the scripts listed above, it will generate the final .json file that is Beacon Friendly Format in the output_docs folder with the name of the collection followed by .json extension, e.g. genomicVariations.json.

This file will be able to be used in a mongoDB for beacon usage. To know how to import in a Beacon v2, please do as described in Beacon v2 ri api.

Version notes

Acknowledgements

Thanks to all the EGA archive team, and specially: