Using gtf instead of gff

Kyung-TaeLee commented 9 months ago

Hi thank you for great tool

I am planning to run Labrartsc using Drosophila dropseq data. I have gene annotation in gtf (from flybase) format with some newly added gene models from transcriptome assembly. Can I use this for the analysis?

taliaferrojm commented 9 months ago

Hi Kyung-TaeLee,

LABRAT is expecting a gff file, not gtf. You can add what ever gene models you like, but it does need to be in gff format. GFF files from different places have different attributes and tags, some of which LABRAT is expecting. The safest route would be to use an annotation from Ensembl (with your new gene models added) as LABRAT is currently set up to play well with annotations from Ensembl.

Kyung-TaeLee commented 9 months ago

Thank you for the quick response.

Following your suggestion, I converted my gtf file into gff file and made sure that "gene_type" and "transcript_type" is included in the attribute for the successful running. For the test run of my single-cell RNA-seq (Drop-seq), I first ran salmon indexing as follows:

command line: ''' python /home/kyungtae/programs/yes/envs/labrat/bin/LABRAT.py --mode makeTFfasta --gff /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/result/labrat/longread_support_anno/fly.corrected.gff --genomefasta /home/kyungtae/reference/fly/BDGP6.22/genome/Drosophila_melanogaster.BDGP6.22.manual.fasta --lasttwoexons --librarytype 3pseq --threads 30 '''

It ran successfully and and .db file and fasta file were generated in the same dirctory that contains gff file

Next, I ran salmon alevin for the quantification but got error:

command line: ''' /home/kyungtae/programs/yes/envs/labrat/bin/salmon alevin -l ISR -1 /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/rawdata/BDGP6.22.APA.novelRNA_2023_11_05_CG14105_altEnd_removed/120AEL_WT_lymphgland/DropSeq_Normal_LG_120hAEL_lib1/DropSeq_120hAEL_lib1_R1.fastq -2 /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/rawdata/BDGP6.22.APA.novelRNA_2023_11_05_CG14105_altEnd_removed/120AEL_WT_lymphgland/DropSeq_Normal_LG_120hAEL_lib1/DropSeq_120hAEL_lib1_R2.fastq --dropseq -i /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/result/labrat/longread_support_anno/TFseqs.fasta -p 40 -o /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/result/labrat/test/120WT_lymph1/ --tgMap /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/result/labrat/longread_support_anno/TFseqs.fasta.tgMap --fldMean 250 --fldSD 20 --validateMappings --whitelist /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/result/sierra/white_list/white_list_dir/DropSeq_Normal_LG_120hAEL_lib1.whitelist.txt '''

error message: ''' [2024-01-26 16:25:06.528] [alevinLog] [error] Index Directory or the txpInfo.bin: /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/result/labrat/longread_support_anno/TFseqs.fasta/txpInfo.bin doesn't exist. ''' When I ran indexing step using labrat.py, it did not generate "txpIfno.bin" directory but the salmon alevin seemed to require it for the run. What steps did I miss or is there anything that I've done wrong?

Plust, my whiltelist file conatins CB of interest per line without header. Is this file format appropriate for the whiltelist?

Thank you for your help and have a great day

taliaferrojm commented 9 months ago

For LABRAT with bulk data, the runSalmon mode will automatically make the salmon index. However, for single cell data, LABRAT directly takes in alevin output that you made yourself. Note that this uses LABRATsc.py (or maybe LABRATsc_dm6.py) and not LABRAT.py. So you will need to manually quantify your samples using alvein. The good news is that you have already made the fasta file for making the alevin (salmon) index.

If it helps, LABRATsc was used to quantify APA in Drosophila single cell data in this paper. Let me know if you have more questions.

Kyung-TaeLee commented 9 months ago

Based on your suggestion, can I assume that following steps should be taken?

first, I should re-run "python /home/kyungtae/programs/yes/envs/labrat/bin/LABRAT.py --mode makeTFfasta ~" using LABRATsc.py or LABRATsc_dm6.py

second, after making TFfasta, I should make salmon alevin index using TFfasta file and then run salmon alevin myself.

third, I proceed with calculating psi using LABRATsc.py or LABRATsc_dm6.py usinf the alevin output

Let me know if I understood your suggestions correctly.

Thank you so much for your responses!

taliaferrojm commented 9 months ago

Yup, I think you got it. I'm pretty sure the TFfasta produced by LABRAT.py would be the same as the one produced by LABRATsc.py, but it also wouldn't hurt to make it again.

Because every gff file seems to be slightly different (maddeningly so), there might be another hiccup or two in the calculating psi part, but it should be fixable. Let me know if you run into more problems.

taliaferrojm commented 9 months ago

Oh, also, this paper has step by step instructions on using LABRAT with single cell data. It could be useful.

Kyung-TaeLee commented 9 months ago

Dear Dr.Taliaferro

Thank you so much for your help. I ran the salmon indexing myself and successfully finished the salmon alevin quantification step. After that, I ran LABRATsc.py by the following commandline.

''' /home/kyungtae/programs/yes/envs/labrat/bin/python /home/kyungtae/programs/yes/envs/labrat/bin/LABRATsc.py --mode cellbycell --gff /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/result/labrat/longread_support_anno/fly.corrected.gff --alevindir /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/result/labrat/alevin_output/ --readcountfilter 5 --conditions /home/kyungtae/dataset/fly_waspInfest_lncRNA/singlecell/result/labrat/cellid_conditions.txt '''

First, "[err]gffutils.exceptions.FeatureNotFoundError" occured and found that someone already encountered the same problem and asked the questions at this github repo. I changed the code (both LABART.py and LABARYsc.py) as you suggested in the issue and ran the whole step beginning from makeTFFasta again. (https://github.com/TaliaferroLab/LABRAT/issues/9)

Now I got different error: File "/home/kyungtae/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc raise KeyError(key) from err KeyError: 'scf-BIGFLY-AltEnd-2'

I added novel transcripts to the reference gene model so that some of the IDs contain hyphen in the IDs. I also disabled the line of the original code that split gene ID by "." since some of the gene IDs contain dots. I think some of these IDs are not recognized by the script. Can you tell how to fix this problem? If you need to look into the GFF file I am using, I will send it you by e-mail.

Thank you again and hope to hear from you soon

Best regards, Kyung-Tae Lee

taliaferrojm commented 9 months ago

Can you send me the full error message for the KeyError: 'scf-BIGFLY-AltEnd-2'? Can you go ahead and send me your gff too?

Kyung-TaeLee commented 9 months ago

Here is the full error message.

''' Traceback (most recent call last): File "/home/kyungtae/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'scf-BIGFLY-AltEnd-2'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/kyungtae/programs/yes/envs/labrat/bin/LABRATsc.py", line 825, in psidf = calculatepsi_cellbycell(bigdf, posfactors, args.readcountfilter) File "/home/kyungtae/programs/yes/envs/labrat/bin/LABRATsc.py", line 413, in calculatepsi_cellbycell counts = row[txid] File "/home/kyungtae/.local/lib/python3.6/site-packages/pandas/core/series.py", line 882, in getitem return self._get_value(key) File "/home/kyungtae/.local/lib/python3.6/site-packages/pandas/core/series.py", line 990, in _get_value loc = self.index.get_loc(label) File "/home/kyungtae/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc raise KeyError(key) from err KeyError: 'scf-BIGFLY-AltEnd-2'

'''

I also sent you the GFF file to your gmail account. Thank you!

taliaferrojm commented 9 months ago

OK I think we can get to the bottom of this. Can you try a few things for me?

Do you have a version of this GFF without your new features? From what I can see, maybe even just grep -v BIGLAB and then run LABRATsc on that (starting with makeTFfasta)? If we can narrow down the problem to being associated with the new features that would help. Keep in mind that the indexing step to make the db for the gff first checks to see if there is a .db file in the same directory as the gff. So if you want to make a new db of a gff you have already run, you need to delete that db file first.
If you look in the fasta produced by makeTFfasta that you have now (meaning the one you used to make the salmon/alevin index), does it have a sequence for scf-BIGFLY-AltEnd2? Do you end up with counts assigned to it in the alevin output? My guess right now is that for some reason that transcript is not being quantified by alevin.

Kyung-TaeLee commented 9 months ago

First I tried running the LABART based on your first suggestions. I made GFF that does not contain IDs with "BIGLAB" letters. However, similar error occured as follows ''' Traceback (most recent call last): File "/home/kyungtae/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'FBtr0072677'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/kyungtae/programs/yes/envs/labrat/bin/LABRATsc.py", line 825, in psidf = calculatepsi_cellbycell(bigdf, posfactors, args.readcountfilter) File "/home/kyungtae/programs/yes/envs/labrat/bin/LABRATsc.py", line 413, in calculatepsi_cellbycell counts = row[txid] File "/home/kyungtae/.local/lib/python3.6/site-packages/pandas/core/series.py", line 882, in getitem return self._get_value(key) File "/home/kyungtae/.local/lib/python3.6/site-packages/pandas/core/series.py", line 990, in _get_value loc = self.index.get_loc(label) File "/home/kyungtae/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc raise KeyError(key) from err KeyError: 'FBtr0072677' '''.

Then I looked into "quants_mat_cols.txt" in the alevin output directory to find out what features were quantified. All the features were gene symbols, not the transcripts. Maybe I am doing something wrong so that alevin quantification was done in gene level and this leads to descrepancy with LABARYsc trying to calculate PSI based on expression levels in transcripts?

and I also checked the makeTFfasta output and "scf-BIGFLY-AltEnd2" was included in the fasta file.

taliaferrojm commented 8 months ago

Are you supplying a --tgMap file to LABRATsc?

Kyung-TaeLee commented 8 months ago

You mean in the salmon alevin step? Yes I included tgMap option in the salmon alevin quantification step. tgMap file consist of transcript IDs in the 1st column and gene symybols in the 2nd column.

Kyung-TaeLee commented 8 months ago

I managed to run the whole step successfully and am now looking into the results. Those who want to run labratsc using custom gtf or gff, you can follow through questions and comments discussed in this post. In addition, I will suggest a few more things to consider when analzying custom gene annotation.

If you have dot included in your gene id, please remove it or fix the source code (labrart~~.py) where it identifies gene Id.
Some of gene annotation include genes which have different identifiers but are located in exactly the same genomic cooridinates (for example, even reference gene model from model organism such as fly include those cases). If that is your case of analysis, include --keepDuplicates option when running salmon index.

and thank Dr.Taliaferro very much for kindly and prompt reponses.

TaliaferroLab / LABRAT

Using gtf instead of gff #11