TaliaferroLab / LABRAT

Lightweight Alignment Based Resolution of Alternative Three Prime Ends
MIT License
8 stars 4 forks source link

LABRATsc - [err]gffutils.exceptions.FeatureNotFoundError #9

Closed lancy-liang closed 1 year ago

lancy-liang commented 1 year ago

Hi,

When I tried to use LABRATsc, I used the following command:

LABRATsc.py --mode cellbycell --gff $gff_path/Sus_scrofa.Sscrofa11.1.109.chr.gff3 --alevindir $alevindir_path --conditions $condition_path/tissue1_tissue2.tsv --readcountfilter 5 --conditionA tissue1 --conditionB tissue2

but err, (labrat) [name@login1 script]$ cat 8468137.err /public/home/name/miniconda3/envs/labrat/lib/python3.6/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm Traceback (most recent call last): File "/public/home/name/miniconda3/envs/labrat/bin/LABRATsc.py", line 816, in posfactors = getpositionfactors(args.gff, 25) File "/public/home/name/miniconda3/envs/labrat/bin/LABRATsc.py", line 755, in getpositionfactors g = db[gene] File "/public/home/name/miniconda3/envs/labrat/lib/python3.6/site-packages/gffutils/interface.py", line 227, in getitem raise FeatureNotFoundError(key) gffutils.exceptions.FeatureNotFoundError: ENSSSCG00000004016

(labrat) [name@login1 script]$ cat 8468137.out Collecting count data... Reading data in /public/home/name/workspace/scRNA_APA/labratsc/output/tissue/alevin_out/tissue2... Done reading 500 cells Read total 545 cells Found total 1249793 reads Reading data in /public/home/name/workspace/scRNA_APA/labratsc/output/tissue/alevin_out/tissue1... Done reading 500 cells Read total 545 cells Found total 1249793 reads Adding condition/cluster info... Assigning transcripts to polyA sites... Indexing gff... Done indexing!

lancy-liang commented 1 year ago

grep "ENSSSCG00000004016" Sus_scrofa.Sscrofa11.1.109.chr.gff3 1 ensembl gene 1527441 1550812 . - . ID=gene:ENSSSCG00000004016;biotype=protein_coding;gene_id=ENSSSCG00000004016;logic_name=ensembl;version=5 1 ensembl mRNA 1527441 1537814 . - . ID=transcript:ENSSSCT00000004441;Parent=gene:ENSSSCG00000004016;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=ENSSSCT00000004441;version=5 1 ensembl mRNA 1527441 1550812 . - . ID=transcript:ENSSSCT00000104344;Parent=gene:ENSSSCG00000004016;biotype=protein_coding;transcript_id=ENSSSCT00000104344;version=1 1 ensembl mRNA 1527609 1537814 . - . ID=transcript:ENSSSCT00000095810;Parent=gene:ENSSSCG00000004016;biotype=protein_coding;transcript_id=ENSSSCT00000095810;version=1

taliaferrojm commented 1 year ago

This is a gffutils exception. There is an issue with your database. Your gff file contains feature(s) that are not in the database. Either you are using a gff file and database that do not match, or you stopped database creation before it finished. Note that the second option is easy to do as if you kill the script before database creation is finished, a .db file will still be made. When LABRAT looks for a .db file, it will find that one and use it.

I would advise to delete that .db file and start fresh. Note that while database creation make take a while, you only have to do it once.

lancy-liang commented 1 year ago

Hi,I have deleted the .db file and tried running it again, but I still encounter the same error: raise FeatureNotFoundError(key) gffutils.exceptions.FeatureNotFoundError: ENSSSCG00000004016.

taliaferrojm commented 1 year ago

OK. This is a difficulty with dealing with the wide variety of formatting when it comes to gff attributes. Try this. At line 755, replace

g = db[gene]

with

g = db['gene:' + gene]

If that changes and/or fixes things, I can update LABRAT with some more sophisticated code at that location.

lancy-liang commented 1 year ago

Dear taliaferrojm, I have made the change as you suggested, but I still encounter an error:

/public/home/name/miniconda3/envs/labrat/lib/python3.6/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm Traceback (most recent call last): File "/public/home/name/miniconda3/envs/labrat/bin/LABRATsc.py", line 816, in posfactors = getpositionfactors(args.gff, 25) File "/public/home/name/miniconda3/envs/labrat/bin/LABRATsc.py", line 761, in getpositionfactors t1 = db[txpf1] File "/public/home/name/miniconda3/envs/labrat/lib/python3.6/site-packages/gffutils/interface.py", line 227, in getitem raise FeatureNotFoundError(key) gffutils.exceptions.FeatureNotFoundError: ENSSSCT00000095810

However, the gene ENSSSCT00000095810 is different from the previously reported error.

Many thanks!

taliaferrojm commented 1 year ago

OK try this. At line 758, replace

txpf1 = tx

with

txpf1 = 'transcript:' + tx

and at line 760, replace

txpf2 = tx

with

txpf2 = 'transcript:' + tx

lancy-liang commented 1 year ago

Dear taliaferrojm, I have made the change as you suggested, but I still encounter an error: Regarding the error message: I have a single-cell 10X 3' end BAM file (generated by cellranger), and I encountered the following error when using countfrombam.py (base) [usersname@login1 script]$ cat 8946612.err 2023-09-01 18:40:48,515 - INFO - Populating features 2023-09-01 21:57:24,505 - INFO - Populating features table and first-order relations: 1530811 features 2023-09-01 21:57:24,505 - INFO - Updating relations 2023-09-01 21:57:43,087 - INFO - Creating relations(parent) index 2023-09-01 21:57:44,960 - INFO - Creating relations(child) index 2023-09-01 21:57:47,140 - INFO - Creating features(featuretype) index 2023-09-01 21:57:48,671 - INFO - Creating features (seqid, start, end) index 2023-09-01 21:57:50,372 - INFO - Creating features (seqid, start, end, strand) index 2023-09-01 21:57:52,174 - INFO - Running ANALYSE features Traceback (most recent call last): File "/./workspace/scRNA_APA/labrat/script/countfrombam.py", line 356, in positionfactorintervals = getpositionfactorintervals(args.gff, posfactors) File "/./workspace/scRNA_APA/labrat/script/countfrombam.py", line 254, in getpositionfactorintervals tx = db[txid] File "/./miniconda3/envs/labrat/lib/python3.6/site-packages/gffutils/interface.py", line 227, in getitem raise FeatureNotFoundError(key) gffutils.exceptions.FeatureNotFoundError: ENSSSCT00000042748

Regarding the output issues:

(1) Only one file is generated: numberofposfactors.txt, and it is saved in the same path as the submitted script. If I use slurm array to submit the script, only one numberofposfactors.txt file is generated, and the others are overwritten. (2) I didn't obtain any output related to calculating PSI. (3) I read the article "Diverse cell-specific patterns of alternative polyadenylation in Drosophila" and it mentioned putting the BAM file into labrat. However, when using countfrombam.py, I didn't get the classification for ALE and TUTR. Which script should I use for a 10X 3' end BAM file? Many thanks!

taliaferrojm commented 1 year ago

You are likely going to have to make the same changes to countfrombam.py that you made to LABRATsc.py. The issues is likely the same, i.e. that the annotation you are using formats the "gene:" and "transcript:" attributes slightly differently than the GENCODE annotations LABRAT was created with.

lancy-liang commented 1 year ago

I have made the changes as you previously instructed, but I am still getting an error.

taliaferrojm commented 1 year ago

I need more info to help, like the error you are getting now. Did you make the changes described both here and here?

lancy-liang commented 1 year ago

this is the modified countfrombam.py file as per your instructions. (See attachment)(I have sent this script countfrombam.py to your email)

The gff file I used is :https://ftp.ensembl.org/pub/release-104/gff3/sus_scrofa/Sus_scrofa.Sscrofa11.1.104.chr.gff3.gz

I have made the change as you suggested(replace g = db[gene] with g = db['gene:' + gene], and txpf1 = 'transcript:' + tx,txpf2 = 'transcript:' + tx) but I still encounter an error:

I have a single-cell 10X 3' end BAM file (generated by cellranger), and I encountered the following error when using countfrombam.py

(base) @.*** script]$ cat 8946612.err 2023-09-01 18:40:48,515 - INFO - Populating features 2023-09-01 21:57:24,505 - INFO - Populating features table and first-order relations: 1530811 features 2023-09-01 21:57:24,505 - INFO - Updating relations 2023-09-01 21:57:43,087 - INFO - Creating relations(parent) index 2023-09-01 21:57:44,960 - INFO - Creating relations(child) index 2023-09-01 21:57:47,140 - INFO - Creating features(featuretype) index 2023-09-01 21:57:48,671 - INFO - Creating features (seqid, start, end) index 2023-09-01 21:57:50,372 - INFO - Creating features (seqid, start, end, strand) index 2023-09-01 21:57:52,174 - INFO - Running ANALYSE features Traceback (most recent call last): File "/./workspace/scRNA_APA/labrat/script/countfrombam.py", line 356, in positionfactorintervals = getpositionfactorintervals(args.gff, posfactors) File "/./workspace/scRNA_APA/labrat/script/countfrombam.py", line 254, in getpositionfactorintervals tx = db[txid] File "/./miniconda3/envs/labrat/lib/python3.6/site-packages/gffutils/interface.py", line 227, in getitem raise FeatureNotFoundError(key) gffutils.exceptions.FeatureNotFoundError: ENSSSCT00000042748

------------------ 原始邮件 ------------------ 发件人: "TaliaferroLab/LABRAT" @.>; 发送时间: 2023年9月8日(星期五) 晚上10:55 @.>; @.**@.>; 主题: Re: [TaliaferroLab/LABRAT] LABRATsc - [err]gffutils.exceptions.FeatureNotFoundError (Issue #9)

I need more info to help, like the error you are getting now. Did you make the changes described both here and here?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

lancy-liang commented 1 year ago

I made it! thank you very much !!!

taliaferrojm commented 1 year ago

Glad you were able to figure it out!