unable to create promotor annotation for a non-model organism

ShenTTT commented 1 month ago

Hi, I am trying to create the promotor annotation for a non-model butterfly species, using a gtf file from NCBI, I am not sure how to specify the species so here I used drosophila:

promoterAnnotation <- preparePromoterAnnotation(file = "genomic.gtf",species = "Drosophila_melanogaster")

Then I got:

Parsing input file...
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Extract exons by transcripts...
Identify overlapping first exons for each gene...
Prepare mapping between transcripts, tss, promoters and genes...
Error in `[[<-`(`*tmp*`, name, value = "tss.") : 
  1 elements in value to replace 0 elements
In addition: Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of
  type stop_codon. This information was ignored.
2: There were 2 warnings in `dplyr::summarise()`.
The first warning was:
ℹ In argument: `start = min(.data$start)`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning. 
> dplyr::last_dplyr_warnings()
[[1]]
<warning/rlang_warning>
Warning in `dplyr::summarise()`:
ℹ In argument: `start = min(.data$start)`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
---
Backtrace:
    ▆
 1. ├─proActiv::preparePromoterAnnotation(file = "genomic.gtf", species = "Drosophila_melanogaster")
 2. │ └─proActiv:::getReducedExonRanges(exonRanges.firstExon, exonRanges.firstExon.geneId)
 3. │   └─... %>% ...
 4. ├─dplyr::select(...)
 5. ├─dplyr::ungroup(.)
 6. ├─dplyr::summarise(...)
 7. └─dplyr:::summarise.grouped_df(...)

[[2]]
<warning/rlang_warning>
Warning in `dplyr::summarise()`:
ℹ In argument: `end = max(.data$end)`.
Caused by warning in `max()`:
! no non-missing arguments to max; returning -Inf
---
Backtrace:
    ▆
 1. ├─proActiv::preparePromoterAnnotation(file = "genomic.gtf", species = "Drosophila_melanogaster")
 2. │ └─proActiv:::getReducedExonRanges(exonRanges.firstExon, exonRanges.firstExon.geneId)
 3. │   └─... %>% ...
 4. ├─dplyr::select(...)
 5. ├─dplyr::ungroup(.)
 6. ├─dplyr::summarise(...)
 7. └─dplyr:::summarise.grouped_df(...)

I wonder if the tool does support non-model organisms?

jleechung commented 1 month ago

Hi @ShenTTT, thanks for your interest in proActiv. For non-model organisms, building the transcript database may require some preprocessing of the annotations. Could you try filtering as described here https://github.com/GoekeLab/proActiv/issues/48#issuecomment-1826398911?

ShenTTT commented 1 month ago

Hi @jleechung I am not able to do that since the annotation is not in the 'chr' format, here's my gtf:

#gtf-version 2.2
#!genome-build ilBicAnyn1.1
#!genome-build-accession NCBI_Assembly:GCF_947172395.1
#!annotation-source NCBI RefSeq GCF_947172395.1-RS_2022_12
NC_069083.1 Gnomon  gene    36407   42389   .   +   .   gene_id "LOC128198437"; transcript_id ""; db_xref "GeneID:128198437"; description "DNA repair protein complementing XP-A cells homolog"; gbkey "Gene"; gene "LOC128198437"; gene_biotype "protein_coding"; 
NC_069083.1 Gnomon  transcript  36407   42389   .   +   .   gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gbkey "mRNA"; gene "LOC128198437"; model_evidence "Supporting evidence includes similarity to: 8 Proteins"; product "DNA repair protein complementing XP-A cells homolog, transcript variant X1"; transcript_biotype "mRNA"; 
NC_069083.1 Gnomon  exon    36407   36651   .   +   .   gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gene "LOC128198437"; model_evidence "Supporting evidence includes similarity to: 8 Proteins"; product "DNA repair protein complementing XP-A cells homolog, transcript variant X1"; transcript_biotype "mRNA"; exon_number "1"; 
NC_069083.1 Gnomon  exon    38725   39076   .   +   .   gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gene "LOC128198437"; model_evidence "Supporting evidence includes similarity to: 8 Proteins"; product "DNA repair protein complementing XP-A cells homolog, transcript variant X1"; transcript_biotype "mRNA"; exon_number "2"; 
NC_069083.1 Gnomon  exon    39523   42389   .   +   .   gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gene "LOC128198437"; model_evidence "Supporting evidence includes similarity to: 8 Proteins"; product "DNA repair protein complementing XP-A cells homolog, transcript variant X1"; transcript_biotype "mRNA"; exon_number "3"; 
NC_069083.1 Gnomon  CDS 38733   39076   .   +   0   gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; gbkey "CDS"; gene "LOC128198437"; product "DNA repair protein complementing XP-A cells homolog"; protein_id "XP_052739905.1"; exon_number "2"; 
NC_069083.1 Gnomon  CDS 39523   39952   .   +   1   gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; gbkey "CDS"; gene "LOC128198437"; product "DNA repair protein complementing XP-A cells homolog"; protein_id "XP_052739905.1"; exon_number "3"; 
NC_069083.1 Gnomon  start_codon 38733   38735   .   +   0   gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; gbkey "CDS"; gene "LOC128198437"; product "DNA repair protein complementing XP-A cells homolog"; protein_id "XP_052739905.1"; exon_number "2"; 
NC_069083.1 Gnomon  stop_codon  39953   39955   .   +   0   gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; gbkey "CDS"; gene

ShenTTT commented 1 month ago

Hi @jleechung , do you think I need to change all the chromosome accession ID to 'Chrx'something like that? And that has to be compatible with the model species I specified in the command line right?

Is there a reason why the ID has to be one of the model styles? Is there a way to skip this check?

Thanks

jleechung commented 1 month ago

Hi @ShenTTT, I've pushed some changes to allow more flexibility in species. For now, can you re-install proActiv from my fork:

remotes::install_github('jleechung/proActiv')

The preparePromoterAnnotation function now accepts an argument, seqLevels, which specifies which sequences to keep for downstream analysis. There's also no need to specify the species argument now. I'd recommend restricting it to the major chromosomes, since we haven't done much testing outside of that.

With these changes, creating annotations should now work. I've tried with the GTF downloaded from here:

txdb = makeTxDbFromGFF('bicAnyn.gtf') # path to gtf, change as required
chrs = sprintf('NC_069%03d.1', 83:110) # restrict to major chromosomes
anno = preparePromoterAnnotation(txdb = txdb, seqLevels = chrs)

I've also modified the downstream functions to allow more flexibility, but have not tested these yet. Are you using proActiv with junction files or bam files? Would be great if you could give it a test and let us know.

ShenTTT commented 1 month ago

Hi @jleechung , Thank you for the modifications, the annotation was generated successfully.

I am using bam files, I assume in that case I will need to forge a BSgenome data package right?

jleechung commented 1 month ago

Yes, I have not done this myself before but you can find more details here.

ShenTTT commented 1 month ago

Hi @jleechung, I tried the junction files generated by STAR alignment instead, and I can perform the analysis without any problems.

Thank you very much for your help. I will close this issue.

GoekeLab / proActiv

unable to create promotor annotation for a non-model organism #53