AlvaroRodriguezDelRio / nov-fams-pipeline

9 stars 1 forks source link

Wrapper script for pipeline #4

Open chrissy005 opened 1 month ago

chrissy005 commented 1 month ago

Hello,

I was reading your paper and was very interested to apply this to my dataset. DO you happen to have a wrapper script for this entire pipeline or a sequential order in how the scripts must be executed?

AlvaroRodriguezDelRio commented 1 month ago

Hi!

We do not have any wraper at the moment, will consider working on it in the future. Instructions on how to run the scripts individually are provided in the README.md file, please let me know if you have any doubt.

Cheers,

Alvar

chrissy005 commented 1 month ago

Hello Alvar,

Thank you for pointing me towards the documentation. However, I am still lost on the exact order of commands and scripts that must be used.

I also see some python scripts on the left-hand side of the screen and I am not quite sure how to use them.

AlvaroRodriguezDelRio commented 3 weeks ago

Hi,

Could you please indicate which are the confusing steps? Will try to improve the documentation.

Thanks!

Alvaro

chrissy005 commented 2 weeks ago

Hello Alvaro,

Thank you for following up,

1) I am already stuck at the very first step as my gene names are not in the suggested format. My fasta headers look like this " > MAG001_MAG001_1_5 rank: D; Eukaryotic translation initiation factor eIF2A [PF08662.14]; Dipeptidyl peptidase IV (DPP IV) N-terminal region [PF00930.24]; Protein of unknown function (DUF1513) [PF07433.14]; IKI3 family [PF04762.15] (db=pfam)".

Do you think I can run the very step without having to change the headers?

2) Assuming I manage to run the first step on 'Deep homology-based protein clustering' with MMseqs2 with the suggested flag "--min-seq-id 0.3 -c 0.5 --cov-mode 1 --cluster-mode 2 -e 0.001", is the output from this (multifasta.faa) then the input for mapping the gene families against each of the 4 reference databases for isolating those exclusive on uncultivated taxa?

I have similar questions for all the steps. As I am not familiar with or have tried any of the steps you have outlined in your pipeline, I am unable to figure out what the input and output files should be for each step. I apologize if these questions are too basic but if there were more details on the examples of the input and output files for each step in a sequential order, I would be able to run this seamlessly on my dataset as this is an excellent method to identify novel genes and I am very keen to apply it.

AlvaroRodriguezDelRio commented 2 weeks ago

Hi!

The scripts assume that fasta headers are formated as indicated, if not they will not work. You can also adapt the scripts for your specific fasta format, but this may be more complicated. Answering to your second question, the initial multifasta file (multifasta.faa) is the one you need to map against the reference databases. Will clarify this in the documentation.

In case you wanna avoid running the whole pipeline, you can also map your sequences against the collection of novel gene families that we presented in the paper using eggnog-mapper (http://eggnog-mapper.embl.de/).

Cheers,

Alvaro

chrissy005 commented 2 weeks ago

Hello Alvaro,

Thank you for explaining. So based on what your suggesting, as my fasta headers are not in the desired format, the best approach here would be to not run the whole pipeline and map the sequences against the collection of novel genes witheggnog-mapper ? Am I uderstanding this correctly?

Also, would it be possible to contact you directly through email? if so, could you kindly share your email address if possible?