Closed gitzmh closed 1 year ago
Hi ZHUANG,
Thanks for your interest in our method. Sorry about the late response. Just come back from the holiday weeks. Regarding your first question, "will it affect the output?" My answer is that the effect will be minor. The Kallisto method needs to know every base of the sequence in order to build the de bruijn graph. Therefore, for these unknown bases ('N'), it will generate pseudo-random bases to replace them. Since the unknown sequences are replaced with A|C|T|G randomly, the chance of having reads aligned to them should be low depending on the length or density of the unknown bases. Besides, the replacement is reproducible by fixing the random seed. So you don't need to worry too much about this warning message. Regarding your second question, we are looking into it and will get back to you later.
Hi MH, sorry for getting back to you late. Do you still have the issue mentioned with create_custom_transcriptome_fa.pl? If so, can you send me the input files you used so I can try to re-create the error? Regards, Ken
Hi Ken,
Sorry for not replying in time. I used the following command (for GRCh38):
perl create_custom_transcriptome_fa.pl GRCh38.primary_assembly.genome.fa gencode.v29.annotation.gtf gencode.v29.GRch38.custom_transcriptome.fa
For the reference genome FASTA file and annotation GTF file, they are from: FASTA GTF
And I've attached the output file here. Looking forward to your reply. [Uploading gencode.v29.GRch38.custom_transcriptome.fa.gz…]() gencode.v29.GRch38.custom_transcriptome.fa.gz
By the way, is the provided "custom_transcriptome.fa" in question #1 generated from GRCh38?
Best regards,
MH ZHUANG.
Hi MH, I tested the files you provided and found out the issue. The create_custom_transcriptome_fa.pl script currently matches whole lines of the reference sequence fasta headers when inputting the reference sequence names, therefore, if the header is chr1 1, it will match that instead of just chr1. You can fix it by commenting out the current line 89:
and replacing it with: my ($ref_sequence_name) = $header =~ /^>(\S+)/;
For your other question, yes, the default custom_transcriptome.fa is generated from GRCh38. Regards, Ken
Hi Ken,
Thank you and Christina for your kindness. It really solved my problems. Now the create_custom_transcriptome_fa.pl script works perfectly.
Besides, exploring the pipeline has raised a few more doubts that may not be covered by the title of this issue. Could we discuss the following questions here? If needed, I can start a new issue. 1) Is it necessary to do read trimming(for example, to remove the adapters when there are warnings from fastqc) before running the pipeline? This step is optional for some pipelines because it would end up with a range of lengths. Also, some aligners would take care of unmapped reads including adapters. But I'm not sure about that in kallisto. 2) When I checked the results, I found that some circular FLIs with certain valid “back_splice_junction_read_count” in part 1 had a TPM of 0 in part 2. I am not sure if it is valuable to continue analyzing using the set of data with this kind of appearance.
Thank you again for your help.
Best regards, MH ZHUANG.
Hi MH,
Regards, Ken
I am closing this issue because of no discussions for over 1 month. @gitzmh If you have further questions, you can still post here and I will re-open it.
Dear Dr. Christina Huan Shi,
Hi. Thank you for the great pipeline and detailed manuals.
I’ ve successfully run the pipeline. But I have several questions.
First, here was a WARNING after loading the full_length_isoforms.fa when I was running “psirc-quant index“ :
Will it affect the output?
Besides, I'm curious whether the provided "custom_transcriptome.fa" is GRch37 or GRch38. So I downloaded both versions of reference genome FASTA file and annotation GTF file for Release 29, and tried to create with create_custom_transcriptome_fa.pl. However, output of each version was only 25+MB. Something seemed wrong compared to the 300+MB file provided. Then I checked the logs and found the following three sentences repeated extensively:
For reference, the input files were “GRCh38.primary_assembly.genome.fa” and “gencode.v29.annotation.gtf”, “GRCh37.primary_assembly.genome.fa” and “gencode.v29lift37.annotation.gtf”.
Could you please help me with these questions? I would greatly appreciate your help.
Best regards,
MH ZHUANG.