Closed mpoelchau closed 1 year ago
@amcooksey - could you run the functional annotation pipeline for Saccharomyces eubayanus GCF_001298625.1, so we have some smaller test data for Tina to work with?
@ZhiXuanLai here are paths to an example GO and KEGG file on Ceres:
GO: /project/nal_genomics/amanda.cooksey/protein_sets/Neodiprion_pinetum/NCBI\ Annotation\ Release\ 100\ functional\ annotation/GCF_021155775.1_complete.gaf.tsv
KEGG: /project/nal_genomics/amanda.cooksey/protein_sets/Neodiprion_pinetum/NCBI\ Annotation\ Release\ 100\ functional\ annotation/KOBAS/GCF_021155775.1_KOBAS_acc_pathways.tsv
functional annotation for Saccharomyces eubayanus on CERES: /project/nal_genomics/amanda.cooksey/protein_sets/Saccharomyces_eubayanus/NCBI Annotation Release 100 functional annotation
Update writeLastLine-genePred.cwl:
valueFrom: "echo -e '\nThe file [file] was post-processed to [describe post-processing, if any]. The resulting file is: [Filename]. This file was used for all operations within the i5k Workspace.' >> readme.txt"
to (replace the file names in brackets with the original and processed gff file name inputs)
valueFrom: "echo -e '\nThe file [original-file-name] was post-processed to add functional annotations from the AgBase functional annotation pipeline (https://github.com/agbase). The resulting file is: [processed-file-name]. This file was used for all operations within the i5k Workspace.' >> readme.txt"
@ZhiXuanLai can we include both the original gff file and the processed gff file in the dispatch output?
@ZhiXuanLai when I run the workflow using NA
for the url_table_file
parameter, I get the following error:
INFO [workflow md5checksums] starting step gunzip_table
INFO [step gunzip_table] start
ERROR Exception on step 'gunzip_table'
ERROR [step gunzip_table] Cannot make job: Invalid job input record:
pipeline/flow_md5checksums/gunzip_single.cwl:21:3: Missing required input parameter 'in_gz'
INFO [workflow md5checksums] completed permanentFail
WARNING [step md5checksums] completed permanentFail
INFO [workflow ] completed permanentFail
{}
WARNING Final process status is permanentFail
Hi @mpoelchau
(1) I updated the filenames in writeLastLine-genePred.cwl
. I wonder if the filenames in writeLastLine.cwl
need to be changed too. The current content is: "echo -e '\nThe file [file] was post-processed to [describe post-processing, if any]. The resulting file is: [Filename]. This file was used for all operations within the i5k Workspace.' >> readme.txt"
(2) Sure! I added the two gff files to dispatch output.
(3) My bad! I fixed the error now.
@ZhiXuanLai thanks for the updates! I get the following error when I try to run the pipeline with
url_table_file: [
NA
]
INFO [step gaps-or-not] start
INFO [job gaps-or-not] /tmp/83xizhb_$ perl \
-ne \
'print if /N/' \
id_deleted_file.txt > /tmp/83xizhb_/lines-contain-N.txt
INFO [job gaps-or-not] completed success
INFO [step gaps-or-not] completed success
INFO [workflow gaps_or_not] completed success
INFO [step gaps_or_not] completed success
INFO [workflow ] starting step add_annotation
INFO [step add_annotation] will be skipped
INFO [step add_annotation] completed skipped
INFO [workflow ] starting step apollo2_data_processing
INFO [step apollo2_data_processing] start
ERROR Exception on step 'apollo2_data_processing'
ERROR [step apollo2_data_processing] Cannot make job: Invalid job input record:
pipeline/flow_apollo2_data_processing/processing/workflow.cwl:15:3: Missing required input parameter 'in_gff'
INFO [workflow ] completed permanentFail
{}
WARNING Final process status is permanentFail
@ZhiXuanLai when I run the program with the table file URL, it completes successfully. The readme file looks good. However, I don't see the unprocessed gff in the analyses directory:
apollo@apollo:~$ ls /app/data/other_species/saceub/SEUB3.0/scaffold/analyses/Saccharomyces_eubayanus_Annotation_Release_100/
GCF_001298625.1_SEUB3.0_cds_from_genomic.fna GCF_001298625.1_SEUB3.0_rna_from_genomic.fna readme.txt
GCF_001298625.1_SEUB3.0_genomic.annotated.gff GCF_001298625.1_SEUB3.0_translated_cds.faa
For the readme update, we won't need to change writeLastLine.cwl
- that only pertains to the assembly readme, and that file remains unchanged. Good question though!
@mpoelchau Sorry for not fixing the error. I must run the pipeline without saving the change in yaml file.
I got a question regarding the filename
in writeLastLine.cwl
. I wonder what we would like to fill in [processed-file-name]
field when there is no table file provided (no processed gff file).
Good question! Is it possible to leave that line unchanged?
Update on how to handle writeLastLine-genePred.cwl.
The file $(inputs.original_gff.basename) was post-processed to add functional annotations from the AgBase functional annotation pipeline (https://github.com/agbase). The resulting file is: $(inputs.processed_gff.basename). This file was used for all operations within the i5k Workspace
The file was post-processed to [describe post-processing, if any]. The resulting file is: [Filename]. This file was used for the JBrowse genome browser and the Apollo manual curation tool.
We need to add another process that I forgot about when I described this issue. The functional annotation directory (name is now in the tree variable array) needs to be moved into the analyses directory during the dispatch workflow. We could add a sub-workflow similar to https://github.com/NAL-i5K/Organism_Onboarding/blob/master/flow_dispatch/2other_species/cp_dir.cwl.
@childers could you take a look at the last comment/update?
We need to begin adding functional annotation information to the genome browsers. The most straightforward way to do this is via the annotation gff3 file, prior to creating the apollo/jbrowse files. That means we will change some of the first steps of final-workflow.cwl.
perl add_GO-KEGG_to_RefSeq-gff.pl GO-file Kegg-file GFF table-file > output.gff
GFF.annotated.gff
(e.g.GCF_001298625.1_SEUB3.0_genomic.annotated.gff
).