databio / GenomicDistributions

Calculate and plot distributions of genomic ranges
http://code.databio.org/GenomicDistributions
Other
25 stars 10 forks source link

Support for organisms such as Plasmodium Falciparum with gffs/gtfs that dont have 'gene_biotype' in the 9th column #190

Closed JesseRop closed 2 years ago

JesseRop commented 2 years ago

Dear developers, Thank you for the great tool which promotes good comprehension of the genome. I'm trying to use your tool with Pfalciparum gffs downloaded here ( https://plasmodb.org/common/downloads/release-52/Pfalciparum3D7/gff/data/PlasmoDB-52_Pfalciparum3D7.gff ) then converted using AGAT to gtf

agat_convert_sp_gff2gtf.pl -gff PlasmoDB-52_Pfalciparum3D7.gff -o PlasmoDB-52_Pfalciparum3D7.agat.gtf

NB : The gtf doesn't have the 'gene_biotype' in the 9th column I then get an error when reading trying to obtain TSS from it using getTssFromGTF.

getTssFromGTF(source='PlasmoDB-52_Pfalciparum3D7.agat.gtf', convertEnsemblUCSC=T)

Below is the error.

Got local file: PlasmoDB-52_Pfalciparum3D7.agat.gtf

Error: Problem with `filter()` input `..1`.
ℹ Input `..1` is `gene_biotype == "protein_coding"`.
✖ object 'gene_biotype' not found
Traceback:

1. getTssFromGTF(source = "PlasmoDB-52_Pfalciparum3D7.agat.gtf", 
 .     convertEnsemblUCSC = T)
2. GtfDf %>% dplyr::filter(gene_biotype == "protein_coding", type == 
 .     "gene")
3. dplyr::filter(., gene_biotype == "protein_coding", type == "gene")
4. filter.data.frame(., gene_biotype == "protein_coding", type == 
 .     "gene")
5. filter_rows(.data, ..., caller_env = caller_env())
6. withCallingHandlers({
 .     dots <- new_quosures(flatten(imap(dots, function(dot, index) {
 .         env_filter$current_expression <- index
 .         expand_if_across(dot)
 .     })))
 .     mask$eval_all_filter(dots, env_filter)
 . }, error = function(e) {
 .     local_call_step(dots = dots, .index = env_filter$current_expression, 
 .         .fn = "filter")
 .     abort(c(cnd_bullet_header(), i = cnd_bullet_input_info(), 
 .         x = conditionMessage(e), i = cnd_bullet_cur_group_label()), 
 .         class = "dplyr_error")
 . })
7. mask$eval_all_filter(dots, env_filter)
8. .handleSimpleError(function (e) 
 . {
 .     local_call_step(dots = dots, .index = env_filter$current_expression, 
 .         .fn = "filter")
 .     abort(c(cnd_bullet_header(), i = cnd_bullet_input_info(), 
 .         x = conditionMessage(e), i = cnd_bullet_cur_group_label()), 
 .         class = "dplyr_error")
 . }, "object 'gene_biotype' not found", base::quote(mask$eval_all_filter(dots, 
 .     env_filter)))
9. h(simpleError(msg, call))
10. abort(c(cnd_bullet_header(), i = cnd_bullet_input_info(), x = conditionMessage(e), 
  .     i = cnd_bullet_cur_group_label()), class = "dplyr_error")
11. signal_abort(cnd)
kkupkova commented 2 years ago

Hi! Thank you for your interest in GenomicDistributions.

I just made an update to the package, that should now enable you to use getTssFromGTF function on your GTF file. The function now takes an extra input parameter filterProteinCoding. If you set up filterProteinCoding=FALSE in your script, you should get your list of TSSs.

This functionality is available in version 1.3.4, which is now available on github (should be available on bioconductor development branch by tomorrow), so please make sure you install the lates version of GenomicDistributions. Let me know if you run into more issues or need help with installation.

JesseRop commented 2 years ago

Many thanks for your prompt action on this Kristyna! getTssFromGTF is now working as expected! Is it possible to apply the same fix for other functions that need GTF as an input such as getGeneModelsFromGTF .

kkupkova commented 2 years ago

Thank you for your feedback. I made the same changes to getGeneModelsFromGTF function in GenomicDistributions version 1.3.5. So if you again set filterProteinCoding=FALSE, the function should work for you!

JesseRop commented 2 years ago

Thanks for updating the function, however now I'm getting the error below (even when I use the examples in the tuitorial) ;

PfGeneModels = getGeneModelsFromGTF(source=agat.gtf, features=c("gene", "exon", "three_prime_utr", "five_prime_utr"), filterProteinCoding=FALSE)

Got local file: PlasmoDB-52_Pfalciparum3D7.agat.gtf

Extracting features: gene, exon, three_prime_utr, five_prime_utr

Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'seqlevels': GRanges objects don't support [[, as.list(), lapply(), or unlist() at
  the moment
Traceback:

1. getGeneModelsFromGTF(source = agat.gtf, features = c("gene", 
 .     "exon", "three_prime_utr", "five_prime_utr"), filterProteinCoding = FALSE)
2. unique(GenomeInfoDb::keepStandardChromosomes(reduce(GenomicRanges::makeGRangesFromDataFrame(subsetGtfDf %>% 
 .     filter(type == feat), keep.extra.columns = TRUE)), pruning.mode = "coarse"))
3. GenomeInfoDb::keepStandardChromosomes(reduce(GenomicRanges::makeGRangesFromDataFrame(subsetGtfDf %>% 
 .     filter(type == feat), keep.extra.columns = TRUE)), pruning.mode = "coarse")
4. standardChromosomes(x, species)
5. seqlevels(x)
6. reduce(GenomicRanges::makeGRangesFromDataFrame(subsetGtfDf %>% 
 .     filter(type == feat), keep.extra.columns = TRUE))
7. reduce_impl(.x, .f, ..., .init = .init, .dir = .dir)
8. reduce_init(.x, .init, left = left)
9. x[[1]]
10. x[[1]]
11. getListElement(x, i, ...)
12. getListElement(x, i, ...)
13. stop(wmsg(class(x), " objects don't support [[, as.list(), ", 
  .     "lapply(), or unlist() at the moment"))
14. .handleSimpleError(function (cond) 
  . .Internal(C_tryCatchHelper(addr, 1L, cond)), "GRanges objects don't support [[, as.list(), lapply(), or unlist() at\n  the moment", 
  .     base::quote(getListElement(x, i, ...)))
15. h(simpleError(msg, call))
kkupkova commented 2 years ago

Hi!

This is strange, are you saying you get this error even with the example dataset provided in GenomicDistributions? I just ran the example on my computer without any problem. Could you share with me your gtf file? (kristynakupkova@gmail.com) This way I could have a look what might be a potential problem. Also, could you share the output of sessionInfo()?