NAL-i5K / Organism_Onboarding

A workflow to make organism onboarding pipeline easy to handle as an I/O pipeline
4 stars 1 forks source link

Update final-workflow.cwl to add functional annotations to gff file #141

Closed mpoelchau closed 1 year ago

mpoelchau commented 2 years ago

We need to begin adding functional annotation information to the genome browsers. The most straightforward way to do this is via the annotation gff3 file, prior to creating the apollo/jbrowse files. That means we will change some of the first steps of final-workflow.cwl.

  1. Download the following files:
    1. An NCBI table file (e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/298/625/GCF_001298625.1_SEUB3.0/GCF_001298625.1_SEUB3.0_feature_table.txt.gz; add URL to yml file?)
    2. An NCBI GFF file (e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/298/625/GCF_001298625.1_SEUB3.0/GCF_001298625.1_SEUB3.0_genomic.gff.gz; add URL to yml file?)
  2. Process the downloaded gff file with the following script: https://gitlab.com/i5k_Workspace/monicas-data-processing-scripts/-/blob/master/add_GO-KEGG_to_RefSeq-gff.pl (this is new, needs to be pulled into existing monicas-data-processing-scripts repo on your local)
  3. Script inputs:
    1. GO file (add input path to yml file)
    2. KEGG file (add input path to yml file)
    3. the downloaded GFF
    4. the downloaded table file
  4. Script is used as follows: perl add_GO-KEGG_to_RefSeq-gff.pl GO-file Kegg-file GFF table-file > output.gff
  5. Script output: processed gff file, file name should be GFF.annotated.gff (e.g. GCF_001298625.1_SEUB3.0_genomic.annotated.gff).
  6. change input for https://github.com/NAL-i5K/Organism_Onboarding/blob/master/flow_apollo2_data_processing/processing/workflow.cwl: in_gff is now the processed Gff file
  7. Processed gff file should also be distributed in flow_dispatch workflow
mpoelchau commented 2 years ago

@amcooksey - could you run the functional annotation pipeline for Saccharomyces eubayanus GCF_001298625.1, so we have some smaller test data for Tina to work with?

mpoelchau commented 2 years ago

@ZhiXuanLai here are paths to an example GO and KEGG file on Ceres: GO: /project/nal_genomics/amanda.cooksey/protein_sets/Neodiprion_pinetum/NCBI\ Annotation\ Release\ 100\ functional\ annotation/GCF_021155775.1_complete.gaf.tsv KEGG: /project/nal_genomics/amanda.cooksey/protein_sets/Neodiprion_pinetum/NCBI\ Annotation\ Release\ 100\ functional\ annotation/KOBAS/GCF_021155775.1_KOBAS_acc_pathways.tsv

amcooksey commented 2 years ago

functional annotation for Saccharomyces eubayanus on CERES: /project/nal_genomics/amanda.cooksey/protein_sets/Saccharomyces_eubayanus/NCBI Annotation Release 100 functional annotation

mpoelchau commented 2 years ago

Update writeLastLine-genePred.cwl:

to (replace the file names in brackets with the original and processed gff file name inputs) valueFrom: "echo -e '\nThe file [original-file-name] was post-processed to add functional annotations from the AgBase functional annotation pipeline (https://github.com/agbase). The resulting file is: [processed-file-name]. This file was used for all operations within the i5k Workspace.' >> readme.txt"

mpoelchau commented 2 years ago

@ZhiXuanLai can we include both the original gff file and the processed gff file in the dispatch output?

mpoelchau commented 2 years ago

@ZhiXuanLai when I run the workflow using NA for the url_table_file parameter, I get the following error:

INFO [workflow md5checksums] starting step gunzip_table
INFO [step gunzip_table] start
ERROR Exception on step 'gunzip_table'
ERROR [step gunzip_table] Cannot make job: Invalid job input record:
pipeline/flow_md5checksums/gunzip_single.cwl:21:3: Missing required input parameter 'in_gz'
INFO [workflow md5checksums] completed permanentFail
WARNING [step md5checksums] completed permanentFail
INFO [workflow ] completed permanentFail
{}
WARNING Final process status is permanentFail
ZhiXuanLai commented 2 years ago

Hi @mpoelchau (1) I updated the filenames in writeLastLine-genePred.cwl. I wonder if the filenames in writeLastLine.cwl need to be changed too. The current content is: "echo -e '\nThe file [file] was post-processed to [describe post-processing, if any]. The resulting file is: [Filename]. This file was used for all operations within the i5k Workspace.' >> readme.txt"

(2) Sure! I added the two gff files to dispatch output.

(3) My bad! I fixed the error now.

mpoelchau commented 2 years ago

@ZhiXuanLai thanks for the updates! I get the following error when I try to run the pipeline with

url_table_file: [
NA
]
INFO [step gaps-or-not] start
INFO [job gaps-or-not] /tmp/83xizhb_$ perl \
    -ne \
    'print if /N/' \
    id_deleted_file.txt > /tmp/83xizhb_/lines-contain-N.txt
INFO [job gaps-or-not] completed success
INFO [step gaps-or-not] completed success
INFO [workflow gaps_or_not] completed success
INFO [step gaps_or_not] completed success
INFO [workflow ] starting step add_annotation
INFO [step add_annotation] will be skipped
INFO [step add_annotation] completed skipped
INFO [workflow ] starting step apollo2_data_processing
INFO [step apollo2_data_processing] start
ERROR Exception on step 'apollo2_data_processing'
ERROR [step apollo2_data_processing] Cannot make job: Invalid job input record:
pipeline/flow_apollo2_data_processing/processing/workflow.cwl:15:3: Missing required input parameter 'in_gff'
INFO [workflow ] completed permanentFail
{}
WARNING Final process status is permanentFail
mpoelchau commented 2 years ago

@ZhiXuanLai when I run the program with the table file URL, it completes successfully. The readme file looks good. However, I don't see the unprocessed gff in the analyses directory:

apollo@apollo:~$ ls /app/data/other_species/saceub/SEUB3.0/scaffold/analyses/Saccharomyces_eubayanus_Annotation_Release_100/
GCF_001298625.1_SEUB3.0_cds_from_genomic.fna   GCF_001298625.1_SEUB3.0_rna_from_genomic.fna  readme.txt
GCF_001298625.1_SEUB3.0_genomic.annotated.gff  GCF_001298625.1_SEUB3.0_translated_cds.faa
mpoelchau commented 2 years ago

For the readme update, we won't need to change writeLastLine.cwl - that only pertains to the assembly readme, and that file remains unchanged. Good question though!

ZhiXuanLai commented 2 years ago

@mpoelchau Sorry for not fixing the error. I must run the pipeline without saving the change in yaml file. I got a question regarding the filename in writeLastLine.cwl. I wonder what we would like to fill in [processed-file-name] field when there is no table file provided (no processed gff file).

mpoelchau commented 2 years ago

Good question! Is it possible to leave that line unchanged?

mpoelchau commented 2 years ago

Update on how to handle writeLastLine-genePred.cwl.

mpoelchau commented 2 years ago

We need to add another process that I forgot about when I described this issue. The functional annotation directory (name is now in the tree variable array) needs to be moved into the analyses directory during the dispatch workflow. We could add a sub-workflow similar to https://github.com/NAL-i5K/Organism_Onboarding/blob/master/flow_dispatch/2other_species/cp_dir.cwl.

mpoelchau commented 2 years ago

@childers could you take a look at the last comment/update?