Import data - Githubissues

almeidasilvaf / syntenet

An R package to infer and analyze synteny networks from protein sequences

https://almeidasilvaf.github.io/syntenet/

21 stars 6 forks source link

Import data #8

Closed alexvasilikop closed 2 years ago

alexvasilikop commented 2 years ago

hello,

Thanks for developing such useful tools. I could not find somewhere the proper function to import the data (proteomes and annotations) to syntenet or the correct format.

I have modified the data as described here https://github.com/zhaotao1987/SynNet-Pipeline/wiki/Genome-Preparation. Is this the correct format?

Which function should I use? The documentation describes an example using already existing data so no parsing..

Many thanks Alex

almeidasilvaf commented 2 years ago

Hi, @alexvasilikop

Thank you very much for the feedback.

There is a note in the vignette pointing to the functions Biostrings::readAAStringSet() (to read FASTA files) and rtracklayer::import() (to read GFF/GTF files).

However, if you haven't noticed it, it probably should not be a simple note.

I will write a whole section explaining how to import FASTA and GFF/GTF files to the R session. Give me some minutes.

Best, Fabricio

almeidasilvaf commented 2 years ago

Hi, @alexvasilikop

I wrote a section on how to load FASTA files as a list of AAStringSet objects and GFF/GTF files as a GRangesList.

Check out the new documentation website: https://almeidasilvaf.github.io/syntenet/

Best, Fabricio

jhcuarta commented 1 year ago

Hi Sorry to bother on a closed issue; I'm really eager to run a mycrosynteny based phylogeny. But I was wondering if you could guide me how to import the fasta and gff files corresponding for analysis. I 've been having isssues trying the formats .pep and .bed i got no success. I see there's a way using Biostrings::readAAStringSet() (to read FASTA files) and rtracklayer::import(), but I'm a little bit lost, I do no quite posses much expertirse.

almeidasilvaf commented 1 year ago

Hi, @jhcuarta

After this issue was opened, I updated the documentation with an entire section on how to load data from FASTA and GFF files (see here).

Besides, note that:

.pep is not a file format. Some people like to add .pep to the filename to indicate that it contains peptide/protein sequences, but it still represents a FASTA file. The correct extension should be .fa or .fasta (e.g., speciesA.fa, or even speciesA.pep.fa)
gene coordinates should be stored as GFF/GFF3 files, not as BED files. Please, read the section on data import in the documentation.

Best, Fabricio

jhcuarta commented 1 year ago

Hi I was wondering if you could help me out since my data didn't pass the check_input, I'm confused since both files were obtained using Prokka 1.14.6, names for protein and headers must match, isn't it. I'll provide two links with my data files so you can give me a hand

https://drive.google.com/file/d/1RT_IKKFsnGGTS0E_SBkdPeWu1E3GVtVq/view?usp=sharing https://drive.google.com/file/d/11zkv1m2fEZA7BjQalRlWnAqdWEyaeC2-/view?usp=sharing

Best regards and thanks ahead

jhcuarta commented 1 year ago

Hi I had reedited the fasta sequences headers so the names would match but didn't pass the check_input filter, could you please take a look at the files so I can know how to reedit the files in order to match. I'm bewildered since both files were obtained from the same application Prokka 1.14.6 and coding must be preserved through files output. Could you please help me out, and check out the files, I'm really eager to use your development. Thanks ahead.

almeidasilvaf commented 1 year ago

Hi.

Please, do not ask for help in someone else's closed issue. It's part of the etiquette of asking for help online.

If you have a problem, you have 2 options:

If it's a question on how to use the package, ask a question on the Bioconductor support site with the tag "#syntenet" on it, so I will get an automatic email notification and be happy to help you with your problem. This way, other people facing a similar problem can see your question and find it useful. Remember to also include a reproducible example showing what you tried, the error message you got, and what your data looks like.
If it's a bug (something does not work as it should), open a new issue in this GitHub repo. Again, always include a reproducible example.

One cannot simply include links to hundreds of Mb and ask others to download the file and inspect whatever error there is.

Thanks for understanding, Fabricio