almeidasilvaf / doubletrouble

An R package to identify and classify duplicated genes from whole-genome protein sequence data
https://almeidasilvaf.github.io/doubletrouble/
11 stars 0 forks source link

add the Phytozome API #6

Closed WWz33 closed 1 month ago

WWz33 commented 2 months ago

Hi, Fabricio I have learned a lot about comparative genomics by studying your packages and articles,thanks. Would you like to add the Phytozome API?

Best!

almeidasilvaf commented 2 months ago

Hi, @WWz33

Thanks a lot for your feedback! :)

What do you mean by 'add the Phytozome API'?

WWz33 commented 2 months ago

Hi, Fabricio biomart has been possible to download genomic data from ensembl and NCBI,not Phytozome. Have you considered downloading and converting to 'pdata' directly from these sites?This will simplify the frustrating data preparation process. And I don't quite understand the logic in classify_gene_pairs(,scheme="extended").If I have 5 target species and one outgroup, how do I build blast_inter? I would appreciate it if you would answer for me!

Best!

almeidasilvaf commented 2 months ago

Hi, @WWz33

That's a very good idea. I thought about it when writing the vignette for doubletrouble, but at that time {biomartr} didn't have the option to download data from Ensembl Genomes instances. Now this functionality seems to be stable, so I will see if this is possible. I have to check how long it takes to download the data in the vignette, because there's a time limit for vignettes to run.

Regarding the blast_inter parameter of classify_gene_pairs(), you would do exactly as documented here, but in your data frame of comparisons (named comparisons in the vignette) you would have multiple rows indicating what are the query species and what are the outgroup species. For example, if you have species spA, spB, spC, and spD, which share the same outgroup spX, you would build your data frame with:

comparisons <- data.frame(
    species = c("spA", "spB", "spC", "spD"),
    outgroup = "spX"
)

Then, you can run the interspecies DIAMOND searches with:

diamond_inter <- run_diamond(
    seq = pdata$seq,
    compare = comparisons,
    outdir = file.path(tempdir(), "diamond_inter"),
    ... = "--sensitive"
)

You just have to make sure that species names in the comparisons data frame match species names in pdata.

Does that answer your question?

Best, Fabricio

WWz33 commented 2 months ago

Hi, Fabricio

Thank you very much for your patient explanation. I think I've understood.

Best regards!

almeidasilvaf commented 1 month ago

Hi, @WWz33

Based on your feedback, I just pushed a new version of {doubletrouble} with an expanded vignette that includes more info on the input data and how to obtain data using {biomartr} (see here).

I'll close this issue now.

All the best, Fabricio