Issue running GeMoMa on CCr

Ozzborne commented 9 months ago

Hi Dan, I am also having issues getting GeMoMa to run on CCR. Below is the error message I am getting.

Thank you

Exception in thread "main" de.jstacs.parameters.SimpleParameter$IllegalValueException: Error in parameter(ID): Parameter not permitted: value not valid: GCF_002021735.2_Okis_V2_genomic String does not match \w at de.jstacs.parameters.SimpleParameter.setValue(SimpleParameter.java:422) at de.jstacs.tools.ui.cli.CLI.setValue(CLI.java:606) at de.jstacs.tools.ui.cli.CLI.set(CLI.java:568) at de.jstacs.tools.ui.cli.CLI.set(CLI.java:564) at de.jstacs.tools.ui.cli.CLI.set(CLI.java:582) at de.jstacs.tools.ui.cli.CLI.setToolParameters(CLI.java:502) at de.jstacs.tools.ui.cli.CLI.run(CLI.java:404) at projects.gemoma.GeMoMa.main(GeMoMa.java:399) Exception in thread "main" de.jstacs.parameters.SimpleParameter$IllegalValueException: Error in parameter(annotation): Parameter not permitted: File GeMoMa_combined/final_annotation.gff does not exist at de.jstacs.parameters.FileParameter.setValue(FileParameter.java:305) at de.jstacs.tools.ui.cli.CLI.setValue(CLI.java:606) at de.jstacs.tools.ui.cli.CLI.set(CLI.java:568) at de.jstacs.tools.ui.cli.CLI.setToolParameters(CLI.java:502) at de.jstacs.tools.ui.cli.CLI.run(CLI.java:404) at projects.gemoma.GeMoMa.main(GeMoMa.java:399) mv: cannot stat 'GeMoMa_combined/proteins.fasta': No such file or directory Parameters of tool "Extractor" (Extractor, version: 1.9): a - annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) = GeMoMa_combined/final_annotation.longest_iso> g - genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz) = /projects/academic/tkrabben/Osborne/LT_annotations/SE08//Genome/SE08_Flye_Medaka_Pilon3_Purge_HiC_RagTag_Gap> gc - genetic code (optional user-specified genetic code, type = tabular, OPTIONAL) = null p - proteins (whether the complete proteins sequences should returned as output, default = false) = true c - cds (whether the complete CDSs should returned as output, default = false) = false genomic - genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false) = false i - introns (whether introns should be extracted from annotation, that might be used for test cases, default = false) = false identical - identical (if CDS is identical Extractor only used one transcript. This parameter allows to return a table that lists in the first column the used transcript and in the second column the discarded trans> u - upcase IDs (whether the IDs in the GFF should be upcased, default = false) = false r - repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false) = false s - selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ig> Ambiguity - Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the correspo> d - discard pre-mature stop (if true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true) = true sefc - stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false) = false f - full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true) = true l - long fasta comment (whether a short (transcript ID) or a long (transcript ID, gene ID, chromosome, strand, interval) fasta comment should be written for proteins, CDSs, and genomic regions, default = false) > v - verbose (A flag which allows to output a wealth of additional information, default = false) = false outdir - The output directory, defaults to the current working directory (.) = GeMoMa_combined mv: cannot stat 'GeMoMa_combined/proteins_1.fasta': No such file or directory

dmacguigan commented 9 months ago

I believe this error is because you protein evidence is not in the format expected by the pipeline. From the config file:

## GEMOMA_REFS is the full path to a directory containing GFF and genome FASTA files for reference species
## files must end in .gff or .fasta
## file prefixes for GFF and FASTA must match, 
## file prefixes must be comprised only of letters and numbers, no special characters
## if you ran STEP 5 of this pipeline, GEMOMA_REFS can be the same directory as NCBI_DOWNLOAD_DIR
GEMOMA_REFS="/projects/academic/tkrabben/MacGuigan/genome_annotations/Lpel/GeMoMa_refs"

Looks like your file prefix GCF_002021735.2_Okis_V2_genomic has several special characters (underscores and periods).

If you run step 5 of the pipeline, it will download and rename protein evidence from NCBI. I'd recommend going this route if all your GeMoMa evidence is coming from NCBI.

Ozzborne commented 9 months ago

Ah, see, I knew it was a silly mistake on my part! I removed the special characters and it is working perfectly. Thank you!

KrabbenhoftLab / genome_annotation_pipeline

Issue running GeMoMa on CCr #3