file structure for no_pred

chmccarthy / Pangloss

Pangenome analysis of eukaryotes.

GNU General Public License v3.0

18 stars 4 forks source link

file structure for no_pred #4

Closed fungal-spore closed 4 years ago

fungal-spore commented 4 years ago

Hello, I have pangloss pipeline setup and ran successfully on the test data provided.

We have used alternative gene pred pipeline so have nucleotide and amino seq for our isolates. We will use --no_pred argument, however it is not clear what directory structure, files, and file locations are needed for this to function properly. We would still like to run all other arguments (e.g. blastall, panoct etc). Can you clarify how to arrange files for this? Thanks!

fungal-spore commented 4 years ago

I figured it out, for future reference you need: .faa and .attributes in ./gm_pred/sets then you can run --no_pred.

I had to lookup from PanOCT (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3526259/) requirements for .attributes and write script to convert my .gff3 files from gene prediction into format that was needed.

chmccarthy commented 4 years ago

Hi fungal-spore,

Thanks for pointing this out, there's definitely more info I need to provide in the manual (and maybe in the README) about certain usage situations.

For the moment if users want to use protein and location data from other sources like say NCBI, they'll need to write their own script to convert GFF or GTF files into PanOCT-compatible attributes files. In the future I might look into seeing if something like gffutils might make this aspect of data import easier. My past experiences with parsing GFF files in python were... iffy to put it mildly.

Going to pin this issue just to give people a heads-up in the meantime.

fungal-spore commented 4 years ago

I forgot to mention that you also need the *.nucl file in /gm_pred/sets too, as well you need genomes.fna and genome.txt in ./genomes.