Closed ShenTTT closed 1 month ago
Hi @ShenTTT, thanks for your interest in proActiv. For non-model organisms, building the transcript database may require some preprocessing of the annotations. Could you try filtering as described here https://github.com/GoekeLab/proActiv/issues/48#issuecomment-1826398911?
Hi @jleechung I am not able to do that since the annotation is not in the 'chr' format, here's my gtf:
#gtf-version 2.2
#!genome-build ilBicAnyn1.1
#!genome-build-accession NCBI_Assembly:GCF_947172395.1
#!annotation-source NCBI RefSeq GCF_947172395.1-RS_2022_12
NC_069083.1 Gnomon gene 36407 42389 . + . gene_id "LOC128198437"; transcript_id ""; db_xref "GeneID:128198437"; description "DNA repair protein complementing XP-A cells homolog"; gbkey "Gene"; gene "LOC128198437"; gene_biotype "protein_coding";
NC_069083.1 Gnomon transcript 36407 42389 . + . gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gbkey "mRNA"; gene "LOC128198437"; model_evidence "Supporting evidence includes similarity to: 8 Proteins"; product "DNA repair protein complementing XP-A cells homolog, transcript variant X1"; transcript_biotype "mRNA";
NC_069083.1 Gnomon exon 36407 36651 . + . gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gene "LOC128198437"; model_evidence "Supporting evidence includes similarity to: 8 Proteins"; product "DNA repair protein complementing XP-A cells homolog, transcript variant X1"; transcript_biotype "mRNA"; exon_number "1";
NC_069083.1 Gnomon exon 38725 39076 . + . gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gene "LOC128198437"; model_evidence "Supporting evidence includes similarity to: 8 Proteins"; product "DNA repair protein complementing XP-A cells homolog, transcript variant X1"; transcript_biotype "mRNA"; exon_number "2";
NC_069083.1 Gnomon exon 39523 42389 . + . gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gene "LOC128198437"; model_evidence "Supporting evidence includes similarity to: 8 Proteins"; product "DNA repair protein complementing XP-A cells homolog, transcript variant X1"; transcript_biotype "mRNA"; exon_number "3";
NC_069083.1 Gnomon CDS 38733 39076 . + 0 gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; gbkey "CDS"; gene "LOC128198437"; product "DNA repair protein complementing XP-A cells homolog"; protein_id "XP_052739905.1"; exon_number "2";
NC_069083.1 Gnomon CDS 39523 39952 . + 1 gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; gbkey "CDS"; gene "LOC128198437"; product "DNA repair protein complementing XP-A cells homolog"; protein_id "XP_052739905.1"; exon_number "3";
NC_069083.1 Gnomon start_codon 38733 38735 . + 0 gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; gbkey "CDS"; gene "LOC128198437"; product "DNA repair protein complementing XP-A cells homolog"; protein_id "XP_052739905.1"; exon_number "2";
NC_069083.1 Gnomon stop_codon 39953 39955 . + 0 gene_id "LOC128198437"; transcript_id "XM_052883945.1"; db_xref "GeneID:128198437"; gbkey "CDS"; gene
Hi @jleechung , do you think I need to change all the chromosome accession ID to 'Chrx'something like that? And that has to be compatible with the model species I specified in the command line right?
Is there a reason why the ID has to be one of the model styles? Is there a way to skip this check?
Thanks
Hi @ShenTTT, I've pushed some changes to allow more flexibility in species. For now, can you re-install proActiv from my fork:
remotes::install_github('jleechung/proActiv')
The preparePromoterAnnotation
function now accepts an argument, seqLevels
, which specifies which sequences to keep for downstream analysis. There's also no need to specify the species
argument now. I'd recommend restricting it to the major chromosomes, since we haven't done much testing outside of that.
With these changes, creating annotations should now work. I've tried with the GTF downloaded from here:
txdb = makeTxDbFromGFF('bicAnyn.gtf') # path to gtf, change as required
chrs = sprintf('NC_069%03d.1', 83:110) # restrict to major chromosomes
anno = preparePromoterAnnotation(txdb = txdb, seqLevels = chrs)
I've also modified the downstream functions to allow more flexibility, but have not tested these yet. Are you using proActiv with junction files or bam files? Would be great if you could give it a test and let us know.
Hi @jleechung , Thank you for the modifications, the annotation was generated successfully.
I am using bam files, I assume in that case I will need to forge a BSgenome data package right?
Yes, I have not done this myself before but you can find more details here.
Hi @jleechung, I tried the junction files generated by STAR alignment instead, and I can perform the analysis without any problems.
Thank you very much for your help. I will close this issue.
Hi, I am trying to create the promotor annotation for a non-model butterfly species, using a gtf file from NCBI, I am not sure how to specify the species so here I used drosophila:
promoterAnnotation <- preparePromoterAnnotation(file = "genomic.gtf",species = "Drosophila_melanogaster")
Then I got:
I wonder if the tool does support non-model organisms?