extract_from_sequences.py script run problem

summerxiajq commented 4 years ago

The output file test extracted by the extract_from_sequences.py script is empty, is it because the reference file I provided is wrong? The command :./src/extract_from_sequences.py -t /data1/zhoulab/xiajiaqi/reference/hg38/Homo_sapiens.GRCh38.99.gtf -a /data1/zhoulab/xiajiaqi/reference/hg38/Homo_sapiens.GRCh38.99.chr.gtf -g /data1/zhoulab/xiajiaqi/reference/hg38/Homo_sapiens.GRCh38.dna.primary_assembly.fa -m aln4.bam -o test

cheny19 commented 4 years ago

Hi @summerxiajq ,

First of all, could you double-check how you generated aln4.bam? Is it aligned against the correct genome assembly?

Second, how did you get /data1/zhoulab/xiajiaqi/reference/hg38/Homo_sapiens.GRCh38.99.gtf this file? You should use awk '$3=="transcript"' /data1/zhoulab/xiajiaqi/reference/hg38/Homo_sapiens.GRCh38.99.chr.gtf to get only transcript annotations.

If both are fine, maybe it's better to send me your test file so I can take a closer look.

Thanks, Chen

cheny19 commented 4 years ago

Yes, if your RNA-Seq data is strand-specific, then you should use -stranded when running RNA-Bloom.

When you are aligning assembled transcripts to the reference, you should use reference genome instead of cdna fasta file. I thing this is the main reason why there's no output.

summerxiajq commented 4 years ago

Hello:

Excuse me.I got the output and integrated it with the site as follows: > 3.l.1_chr1_149530108_f 0.01623591.I'm not sure what the number 0.01623591 means.How likely is this site to be the ployA site, or the p-value?How should this threshold be?The article reads:Ntor Termin (A) takes the fixed length sequence as input and performs three - label classification to determine been the sequence contains A poly real site (A), A non - polyadenylated CS,From the result, how can I distinguish the three kinds of sites?Is the error of 30bp above and below this site acceptable?Is the fasta in the result file based on the expression quantity to get the sequence?Can the fasta only be in 3’utr region?Does this approach fully consider the PloyA site in the intron region?Sorry to disturb you again, because the following processing needs the above information, hope to get your help, wish you good luck.

------------------ 原始邮件 ------------------ 发件人: "Chen Yang"<notifications@github.com>; 发送时间: 2020年3月3日(星期二) 凌晨2:24 收件人: "bcgsc/Terminitor"<Terminitor@noreply.github.com>; 抄送: "夏佳琪"<1360983892@qq.com>;"Mention"<mention@noreply.github.com>; 主题: Re: [bcgsc/Terminitor] extract_from_sequences.py script run problem (#2)

Yes, if your RNA-Seq data is strand-specific, then you should use -stranded when running RNA-Bloom.

When you are aligning assembled transcripts to the reference, you should use reference genome instead of cdna fasta file. I thing this is the main reason why there's no output.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

cheny19 commented 4 years ago

The number is the probability of this specific site being a poly(A) site, normally the threshold is 0.5. Based on our experience, if the real site is ~30bp within the vicinity, the probability should still be pretty, but not necessary. The fasta file is purely based on assembly, which is not dependent on expression level. Therefore, it's not only in the 3'utr region either. If the terminal exon is short enough, the extracted sequence may contain more upstream regions as well. If the poly(A) site occurs in a intron region and it is expressed, which means that intron is used as terminal exon for that transcript, so yes, it will be considered in our program, because we only rely on the transcript reconstruction, not the intron/exon information.

Hope that explains your questions. Feel free to contact and I'm happy to help.

bcgsc / Terminitor

extract_from_sequences.py script run problem #2