databio / pepatac

A modular, containerized pipeline for ATAC-seq data processing
http://pepatac.databio.org
BSD 2-Clause "Simplified" License
53 stars 14 forks source link

Help using pepatac with a non-reference genome #158

Closed josruirod closed 2 years ago

josruirod commented 3 years ago

Hi again, I'm trying to set up pepatac with a genome not present in ensembl. Pepatac is working great, but I'm need of some final help to get it fully working. I have use refgenie to build the recipes and this is the output of refgenie list

Local genomes: AgamP_VB Local recipes: bismark_bt1_index, bismark_bt2_index, blacklist, bowtie2_index, bwa_index, cellranger_reference, dbnsfp, dbsnp, ensembl_gtf, ensembl_rb, epilog_index, fasta, fasta_txome, feat_annotation, gencode_gtf, hisat2_index, kallisto_index, refgene_anno, salmon_index, salmon_partial_sa_index, salmon_sa_index, star_index, suffixerator_index, tallymer_index Local assets: AgamP_VB/ bowtie2_index:default, ensembl_gtf.ensembl_gene_body:default, ensembl_gtf.ensembl_tss:default, ensembl_gtf:default, fasta.chrom_sizes:default, fasta.fai:default, fasta:default, gencode_gtf:default, refgene_anno.refgene_exon:default, refgene_anno.refgene_intron:default, refgene_anno.refgene_pre_mRNA:default, refgene_anno.refgene_tss:default, refgene_anno:default, suffixerator_index.esa:default, tallymer_index.search_file:75, tallymer_index.tindex:75

I think I should have everything I need. However, in a run of pepatac I'm not getting the TSS enrichment plots and the annotation of the peaks. In the log I've found the corresponding lines, such as:

Skipping TSS -- TSS enrichment requires TSS annotation file: refgene_tss

Skipping read and peak annotation...

This requires a AgamP_VB annotation file.

But I'm not sure on what I'm missing. It's not crucial because I can do this externally, but I'd like to fully use pepatac.

Thanks for the great support!

jpsmith5 commented 3 years ago

Hey @josruirod, I see you have the refgene_anno file now as related to your past refgenie issue. I'm curious, did you end up manually adding the refgene_anno asset, or did you use the build recipe, i.e. refgenie build AgamP_VB/refgene_anno --files refgene=/path/to/your_refgene_file.txt.gz. I ask, because the build recipe should automatically generate that refgene_tss file which is a child asset of the refgene_anno asset. If you added it manually, it would not have produced this file. In order to produce it manually you could recreate the recipe command and produce the file manually.

gzip -dcf /path/to/your_refgene_file.txt.gz | awk '{{if($4=="+"){{print $3"\t"$5"\t"$5"\t"$13"\t.\t"$4}}else{{print $3"\t"$6"\t"$6"\t"$13"\t.\t"$4}}}}' | LC_COLLATE=C sort -k1,1 -k2,2n -u > refgene_tss.bed

Assuming you were able to recreate the refgene_anno file based on the file schema, the columns will line up correctly for the appropriate output file to be produced.

josruirod commented 3 years ago

Super, thanks. I did recreate the refgene txt file for my organism, and used refgenie build AgamP_VB/refgene_anno -c $REFGENIE --files refgene=$folder3/our_refgene_file.txt.gz. Why the TSS file was not created then?

Anyhow, I've been able to produce it with the line you provided, thanks. Where should I place it? Or should I give it to pepatac with --TSS-name?

And I'm curious why the peaks are not being annotated either. Is this also related to the lack of TSSs? I also built the assets gencode_gtf and ensembl_gtf with a custom gtf as you can see above with refgene list

Thanks

jpsmith5 commented 3 years ago

Do you happen to have the output from the build command still available? Could look to see if it threw any errors, because yes it should have been produced. And yes, you can just pass it with that command. Probably the simplest approach. Yeah, the peaks should be utilizing the feat_annotation asset too, which I see you have. If you want to share a sample's PEPATAC_log.md file here I can look over it and see if anything jumps out to me immediately. You'll need to change the extension to .txt to share it on GitHub. Could you also share a preview of your feat_annotation asset, and let's confirm it looks correct too.

So you could refgenie seek AgamP_VB/feat_annotation to retrieve the path to the asset. Then just zcat -dc <that_path> | head.

josruirod commented 3 years ago

HI, thanks for all the comment. I may have done something wrong imitating the format or there's not enough info on the available annotation, and I can proceed externally to pepatac, but I'd like to confirm.

So, the output of refgenie seek AgamP_VB/feat_annotation is an error because it's not finding it:

refgenconf.exceptions.MissingAssetError:` Genome 'AgamP_VB' exists, but asset 'feat_annotation' is missing

As you can see in my previous message, I have built AgamP_VB/refgene_anno, not AgamP_VB/feat_annotation. I have looked with refgenie build AgamP_VB/feat_annotation -c $REFGENIE --requirements and ensembl_gtf and ensembl_rb are required. I do have a gtf file but I'm not sure if I have an ensembl_rb. Could you point me to the differences or specs of those files?

I'm attaching the output from the refgenie build AgamP_VB/refgene_ann, the output from zcat -dc path/to/refgene_anno | head, and the log of the run (one failed first, but the other is almost complete except for that I'd say).

Thanks!

[PEPATAC_log_test_1120.txt] [head_refgene_anno.txt] [build_feat_anno_log.txt]

jpsmith5 commented 3 years ago

Hey, sorry I misread that. Yeah, I see from the log it successfully produced a TSS file. What does /home/ipbln/jlruiz/bin/pepatac/genomes/AgamP_VB/refgene_anno/default/AgamP_VB_TSS.bed look like if you head that file? And then concurrently, what does refgenie seek AgamP_VB/refgene_anno.refgene_tss return?

The ensembl_rb file is the Ensembl regulatory build file, which I believe is limited to human, mouse, and maybe drosophila. It's essentially a breakdown of known annotated genomic features in gff format. Here's a guide to manually building your own feat_annotation asset.

I also noted you had very low alignment rates (0.09) with this test sample. Is that expected with this sample and species?

josruirod commented 3 years ago

Thanks for the answer, when I head the TSS file it seems it's good:

AgamP4_2L 181305 181305 AGAP004677 . - AgamP4_2L 186936 186936 AGAP004677 . -

And the refgenie seek output is:

/home/ipbln/jlruiz/bin/pepatac/genomes/AgamP_VB/refgene_anno/default/AgamP_VB_TSS.bed

So it seems it's successfully created, but for some reason pepatac is not getting it?

Thanks for the guide for manually building the feat_annotation. I did that and provided it with --anno-name. What type of features should be in column 4? Maybe it's failing because I did not provided any "TSS" feature?

Regarding the low alignment, this was just some testing data. I'll try again with a fresh complete run and keep you posted

jpsmith5 commented 3 years ago

For the feat_annotation, column 4 could be really any sort of feature of interest. If you have exons and introns or maybe promoters and enhancers that would be the "name" of the feature that should be present in column 4. So you could concatenate a file of known promoters and enhancers that are in the format chr, start, end, "promoter/enhancer", ".", strand if that makes sense.

I believe the TSS_name issue you experienced was a bug. I just released a new pipeline version 0.9.9 to help there. If you do NOT provide a --TSS-name to the pipeline it should by default use your refgene_tss asset, which you confirmed exists and looks correct.

Okay, sounds good regarding the alignment. I just wanted to make sure there wasn't anything to note there from a pipeline perspective.

josruirod commented 3 years ago

Hi, I can confirm that the refgene_tss and the annotation of peaks, the enrichment at TSSs... seem to be working now. This is great.

Thanks for the support

jpsmith5 commented 3 years ago

👍