geraldinepascal / FROGS

FROGS is a galaxy/CLI workflow designed to produce an OTUs count matrix from high depth sequencing amplicon data.
GNU General Public License v3.0
24 stars 22 forks source link

Database format? #29

Closed Toliman06 closed 6 years ago

Toliman06 commented 6 years ago

Hello everybody,

I saw we can download some databank files regrouping reference sequences but I need to produce my homemade databank. (http://genoweb.toulouse.inra.fr/frogs_databanks/assignation)

I tried to find some information but I failed. Could you help me and tell me what tis the right way to format my reference data?

mariabernard commented 6 years ago

Hi (again),

I have a python script to do that but it's an on going work actually not shared (and not the top priority, I must warn you).

What you need (at least) is a fasta file with taxonomy provided as sequence description. All taxonomies must contain the same number of rank (typically 7 until species) The database need to be formatted for blastn+ at least (see makeblastdb command ), and RDP classifier at most ( see RDP classifier.jar train program).

The most difficult things to do is to format the taxonomy for RDP.

I hope it helps.

Toliman06 commented 6 years ago

An alpha version python script would be very helpful for me even if it's not finish... ;)

mariabernard commented 6 years ago

Good morning!

So I create a "dirty_toad" ( ;-) ) branch on FROGS github, where you will find in the libexec directory a fasta2RDP.py python script.

This version will convert a fasta and tax files in rdp compatible fasta and tax files. It will not check anything else (number of rank, unicity of taxon name, format, ... ).

Here is the procedure to create a database.

### input files ###
head db.fasta
# >seqID1 
#  ACGT...

head db.tax
# seqID1    Bacteria;Actinobacteria;Actinobacteria;unclassified;unclassified;unclassified;unclassified
# this is a tabulated file. All taxonomies need to be define with the exact same number of rank separated by ";" . 

### conversion for RDP ###
python fasta2RDP.py -d db.fasta -t db.tax -r R1 R2 R3 R4 R5 R6 R7 --rdp-taxonomy out_dir/final_db.tax --rdp-fasta out_dir/final_db.fasta 
# -r define the ranks names, You need to provide as many names as tax level

### train RDP ###
java -Xmx60g -jar RDPTools_DIR/classifier.jar train -o out_dir -s out_dir/final_db.fasta -t out_dir/final_db.tax
# its memory consuming!!!

### copy properties file ###
# look in preformated database we provided, and copy properties file and renamed it
cp silva_128_16S.fasta.properties out_dir/final_db.fasta.properties

### format for blast ###
makeblastdb -in out_dir/final_db.fasta -dbtype nucl'

### test your database ###
# look the command line of affiliation_otu.py in FROGS_DIR/test/test.sh 

### if you use Galaxy do not forget to add the fasta file path in the .loc file, see FROGS readme ###

Hope everything will work.

If your database is published, or if you agree to share it, we would be also glad to propose it to all FROGS users.

Toliman06 commented 6 years ago

Hello, (sorry for the delay) Then, it worked perfectly for my database, eaven if I did not test it. I will keep you in touch and see if I can create other databases...

leoneago commented 2 years ago

Dear @mariabernard , It is still fasta2RDP.py up to date? Would like to use it to create the rdp file for my reference database. Also, trying to use it, it seems that it relies on a module named "frogsNode" that appear to be unavailable to date. Any help will be extremely appreciated, Best,

mariabernard commented 2 years ago

Hello,

frogsNode is one of the python library developped in FROGS. before using fasta2RDP.py you need to add this library to your PYTHONPATH variable.

export PYTHONPATH=[FROGS_PATH]/lib:$PYTHONPATH
python3 [FROGS_PATH]/libexec/fasta2RDP.py  ....

regards

Maria