gitzmh commented 1 year ago

Dear Dr. Christina Huan Shi,

Hi. Thank you for the great pipeline and detailed manuals.

I’ ve successfully run the pipeline. But I have several questions.

First, here was a WARNING after loading the full_length_isoforms.fa when I was running “psirc-quant index“ :

[build] warning: replaced 206841 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides

Will it affect the output?

Besides, I'm curious whether the provided "custom_transcriptome.fa" is GRch37 or GRch38. So I downloaded both versions of reference genome FASTA file and annotation GTF file for Release 29, and tried to create with create_custom_transcriptome_fa.pl. However, output of each version was only 25+MB. Something seemed wrong compared to the 300+MB file provided. Then I checked the logs and found the following three sentences repeated extensively:

Use of uninitialized value in concatenation (.) or string at /home/ Software/psirc/create_custom_transcriptome_fa.pl line 73.
Use of uninitialized value in substr at /home/ Software/psirc/create_custom_transcriptome_fa.pl line 73.
substr outside of string at /home/Software/psirc/create_custom_transcriptome_fa.pl line 73.

For reference, the input files were “GRCh38.primary_assembly.genome.fa” and “gencode.v29.annotation.gtf”, “GRCh37.primary_assembly.genome.fa” and “gencode.v29lift37.annotation.gtf”.

Could you please help me with these questions? I would greatly appreciate your help.

Best regards,

MH ZHUANG.

Christina-hshi commented 1 year ago

Hi ZHUANG,

Thanks for your interest in our method. Sorry about the late response. Just come back from the holiday weeks. Regarding your first question, "will it affect the output?" My answer is that the effect will be minor. The Kallisto method needs to know every base of the sequence in order to build the de bruijn graph. Therefore, for these unknown bases ('N'), it will generate pseudo-random bases to replace them. Since the unknown sequences are replaced with A|C|T|G randomly, the chance of having reads aligned to them should be low depending on the length or density of the unknown bases. Besides, the replacement is reproducible by fixing the random seed. So you don't need to worry too much about this warning message. Regarding your second question, we are looking into it and will get back to you later.

hoyu310 commented 1 year ago

Hi MH, sorry for getting back to you late. Do you still have the issue mentioned with create_custom_transcriptome_fa.pl? If so, can you send me the input files you used so I can try to re-create the error? Regards, Ken

gitzmh commented 1 year ago

Hi Ken,

Sorry for not replying in time. I used the following command (for GRCh38): perl create_custom_transcriptome_fa.pl GRCh38.primary_assembly.genome.fa gencode.v29.annotation.gtf gencode.v29.GRch38.custom_transcriptome.fa

For the reference genome FASTA file and annotation GTF file, they are from: FASTA GTF

And I've attached the output file here. Looking forward to your reply. [Uploading gencode.v29.GRch38.custom_transcriptome.fa.gz…]() gencode.v29.GRch38.custom_transcriptome.fa.gz

By the way, is the provided "custom_transcriptome.fa" in question #1 generated from GRCh38？

Best regards,

MH ZHUANG.

hoyu310 commented 1 year ago

Hi MH, I tested the files you provided and found out the issue. The create_custom_transcriptome_fa.pl script currently matches whole lines of the reference sequence fasta headers when inputting the reference sequence names, therefore, if the header is chr1 1, it will match that instead of just chr1. You can fix it by commenting out the current line 89:

my ($ref_sequence_name) = $header =~ /^>(.*)/;

and replacing it with: my ($ref_sequence_name) = $header =~ /^>(\S+)/;

For your other question, yes, the default custom_transcriptome.fa is generated from GRCh38. Regards, Ken

gitzmh commented 1 year ago

Hi Ken,

Thank you and Christina for your kindness. It really solved my problems. Now the create_custom_transcriptome_fa.pl script works perfectly.

Besides, exploring the pipeline has raised a few more doubts that may not be covered by the title of this issue. Could we discuss the following questions here? If needed, I can start a new issue. 1) Is it necessary to do read trimming(for example, to remove the adapters when there are warnings from fastqc) before running the pipeline? This step is optional for some pipelines because it would end up with a range of lengths. Also, some aligners would take care of unmapped reads including adapters. But I'm not sure about that in kallisto. 2) When I checked the results, I found that some circular FLIs with certain valid “back_splice_junction_read_count” in part 1 had a TPM of 0 in part 2. I am not sure if it is valuable to continue analyzing using the set of data with this kind of appearance.

Thank you again for your help.

Best regards, MH ZHUANG.

hoyu310 commented 1 year ago

Hi MH,

It's optional, I think
This one is more tricky, do you find additional evidence that the BSJ read count for that FLI is confident? For example, you have multiple samples expressing that BSJ, or it can be identified in a circRNA database. You may also visualise those reads on a genome browser and decide whether the BSJ alignments look real. You can further look at the TPM of other FLIs for the same BSJ, see if they have 0 or non-zero TPM.

Regards, Ken

Christina-hshi commented 1 year ago

I am closing this issue because of no discussions for over 1 month. @gitzmh If you have further questions, you can still post here and I will re-open it.

Christina-hshi / psirc

Error to index the inferred FLI and Question of custom_transcriptome_fa #5

my ($ref_sequence_name) = $header =~ /^>(.*)/;