SYSU-zhanglab / RNA-m5C

m5C mapping and site calling pipeline.
MIT License
8 stars 7 forks source link

remove redundance BUG #1

Closed jz-Jiang closed 2 years ago

jz-Jiang commented 2 years ago

When I run this command : python ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py \ -i Homo_sapiens.GRCh38.106.base.sorted \ -o Homo_sapiens.GRCh38.106.noredundance.base \ -g Homo_sapiens.GRCh38.106.genelist

there is error

Traceback (most recent call last): File ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py", line 164, in changeDict(key,gene_id,trans_id) File ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py", line 20, in changeDict 'order':order[biotype.get(gene_id)] KeyError: 'lncRNA'

I‘m not familiar with python and don't know where is the problem Could anyone give me some help? Thanks advance!

jhfoxliu commented 2 years ago

Hi,

The script is used in labelling the genomic bases by RNA types. There are some default [RNA type]:[order] pairs in the script. If the [RNA type] is not included in the default, you will get the error message.

You can use --score option to fix it: (1) create a TXT file, including some rows like:

lncRNA[a TAB here][order]

where order is any INT value. Bigger value means higher priority. For example, if a base is annotated by CDS[prior=10] and lncRNA[piror=5], CDS will be the annotation for the base.

(2) run the script with option: --score [TXT_file_name]

I remember that there are some missing key:value pairs for mouse. Just append the key:value pairs to the score file until the message disappear.

jhfoxliu commented 2 years ago

And did you finish the Hisat2 index build step? Most of the people just don't have enough memory for that step.

jz-Jiang commented 2 years ago

Thanks for your timely reply! I followed your instruction to add the key:value pairs to the score file. but I got another type of error running the script with option: --score *.txt python ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py -i Homo_sapiens.GRCh38.106.base.sorted \ -o Homo_sapiens.GRCh38.106.noredundance.base \ -g Homo_sapiens.GRCh38.106.genelist \ --score score_for_lncRNA.txt and got this error Traceback (most recent call last): File ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py", line 158, in LINE = "\t".join([chr,pos_0,pos_1,dir,gene_id,trans_id,geneName.get(gene_id),isoformName.get(trans_id),type]) TypeError: sequence item 7: expected string, NoneType found

here is my score file: lncRNA 2 rRNA_pseudogene 4 translated_unprocessed_pseudogene 3

jz-Jiang commented 2 years ago

I have finished the Hisat2 index build step. I didn't run into the memory problem. when I run this script, the %MEM is 0.8

jhfoxliu commented 2 years ago

Thanks for your timely reply! I followed your instruction to add the key:value pairs to the score file. but I got another type of error running the script with option: --score *.txt python ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py -i Homo_sapiens.GRCh38.106.base.sorted -o Homo_sapiens.GRCh38.106.noredundance.base -g Homo_sapiens.GRCh38.106.genelist --score score_for_lncRNA.txt and got this error Traceback (most recent call last): File ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py", line 158, in LINE = "\t".join([chr,pos_0,pos_1,dir,gene_id,trans_id,geneName.get(gene_id),isoformName.get(trans_id),type]) TypeError: sequence item 7: expected string, NoneType found

here is my score file: lncRNA 2 rRNA_pseudogene 4 translated_unprocessed_pseudogene 3

Another choice here. Two step to modify the script: (1) find the lines:
order = {'3prime_overlapping_ncrna' : 1, '3prime_overlapping_ncRNA': 1, ... }

(2) insert the key:value pairs as that in the script.

If it spends you too many days to solve the issue, I can send you the pre-built mm10 (ensembl r102) metadata.

jhfoxliu commented 2 years ago

I have finished the Hisat2 index build step. I didn't run into the memory problem. when I run this script, the %MEM is 0.8

The peak memory usage for mouse index build should be 120Gb-160Gb, if I have the correct memory.

Please check if the .ht2 files are OK. For mouse, the indexes should occupy 7-8 Gb disk, and no empty .ht2 file should be found.

jz-Jiang commented 2 years ago

Thanks for your timely reply! I followed your instruction to add the key:value pairs to the score file. but I got another type of error running the script with option: --score *.txt python ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py -i Homo_sapiens.GRCh38.106.base.sorted -o Homo_sapiens.GRCh38.106.noredundance.base -g Homo_sapiens.GRCh38.106.genelist --score score_for_lncRNA.txt and got this error Traceback (most recent call last): File ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py", line 158, in LINE = "\t".join([chr,pos_0,pos_1,dir,gene_id,trans_id,geneName.get(gene_id),isoformName.get(trans_id),type]) TypeError: sequence item 7: expected string, NoneType found here is my score file: lncRNA 2 rRNA_pseudogene 4 translated_unprocessed_pseudogene 3

Another choice here. Two step to modify the script: (1) find the lines: order = {'3prime_overlapping_ncrna' : 1, '3prime_overlapping_ncRNA': 1, ... }

(2) insert the key:value pairs as that in the script.

If it spends you too many days to solve the issue, I can send you the pre-built mm10 (ensembl r102) metadata.

Sorry for the late reply I followed your second method to run the script, it didn't work with the same error "TypeError: sequence item 7: expected string, NoneType found" Here is part of the script: 5NLWQ5TSN1_61](https://user-images.githubusercontent.com/79552887/170288338-fa4d4966-6a90-46b1-95bd-c16686de7fea.png) BTW, I run this script whit hg38 (ensembl 106). If you have the pre-built hg38 (ensembl 106) metadata, it is useful to me. It would be great if you could send me this type of metada to me. Thanks a lot!

jz-Jiang commented 2 years ago

I have finished the Hisat2 index build step. I didn't run into the memory problem. when I run this script, the %MEM is 0.8

The peak memory usage for mouse index build should be 120Gb-160Gb, if I have the correct memory.

Please check if the .ht2 files are OK. For mouse, the indexes should occupy 7-8 Gb disk, and no empty .ht2 file should be found.

I checked the .ht2 files, and the files seems fine. ls C2T/ -hl total 8.4G 1.7G May 23 21:07 HISAT2_C2T.1.ht2 705M May 23 21:07 HISAT2_C2T.2.ht2 12K May 23 19:58 HISAT2_C2T.3.ht2 703M May 23 19:58 HISAT2_C2T.4.ht2 1.7G May 23 21:20 HISAT2_C2T.5.ht2 716M May 23 21:20 HISAT2_C2T.6.ht2 14M May 23 19:58 HISAT2_C2T.7.ht2 2.7M May 23 19:58 HISAT2_C2T.8.ht2 2.9G May 23 19:57 HISAT2_C2T.fa

ls G2A/ -hl total 8.4G 1.7G May 23 22:32 HISAT2_G2A.1.ht2 705M May 23 22:32 HISAT2_G2A.2.ht2 12K May 23 21:21 HISAT2_G2A.3.ht2 703M May 23 21:21 HISAT2_G2A.4.ht2 1.7G May 23 22:45 HISAT2_G2A.5.ht2 716M May 23 22:45 HISAT2_G2A.6.ht2 14M May 23 21:21 HISAT2_G2A.7.ht2 2.7M May 23 21:21 HISAT2_G2A.8.ht2 2.9G May 23 21:21 HISAT2_G2A.fa

jhfoxliu commented 2 years ago

Thanks for your timely reply! I followed your instruction to add the key:value pairs to the score file. but I got another type of error running the script with option: --score *.txt python ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py -i Homo_sapiens.GRCh38.106.base.sorted -o Homo_sapiens.GRCh38.106.noredundance.base -g Homo_sapiens.GRCh38.106.genelist --score score_for_lncRNA.txt and got this error Traceback (most recent call last): File ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py", line 158, in LINE = "\t".join([chr,pos_0,pos_1,dir,gene_id,trans_id,geneName.get(gene_id),isoformName.get(trans_id),type]) TypeError: sequence item 7: expected string, NoneType found here is my score file: lncRNA 2 rRNA_pseudogene 4 translated_unprocessed_pseudogene 3

Another choice here. Two step to modify the script: (1) find the lines: order = {'3prime_overlapping_ncrna' : 1, '3prime_overlapping_ncRNA': 1, ... } (2) insert the key:value pairs as that in the script. If it spends you too many days to solve the issue, I can send you the pre-built mm10 (ensembl r102) metadata.

Sorry for the late reply I followed your second method to run the script, it didn't work with the same error "TypeError: sequence item 7: expected string, NoneType found" Here is part of the script: 5NLWQ5TSN1_61](https://user-images.githubusercontent.com/79552887/170288338-fa4d4966-6a90-46b1-95bd-c16686de7fea.png) BTW, I run this script whit hg38 (ensembl 106). If you have the pre-built hg38 (ensembl 106) metadata, it is useful to me. It would be great if you could send me this type of metada to me. Thanks a lot!

If you have modified the code, just skip the --score options.

I am suddenly awared that this bug may from another issue: just check and delete the last empty line in the TXT.

Like this:

lncRNA[\t]5[\n] rRNA[\t]8[\n] [\n] <--- This one triggers the bug, delete it.

jz-Jiang commented 2 years ago

Thanks for your timely reply! I followed your instruction to add the key:value pairs to the score file. but I got another type of error running the script with option: --score *.txt python ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py -i Homo_sapiens.GRCh38.106.base.sorted -o Homo_sapiens.GRCh38.106.noredundance.base -g Homo_sapiens.GRCh38.106.genelist --score score_for_lncRNA.txt and got this error Traceback (most recent call last): File ~/m5c_practice/pipeline/RNA-m5C-master/0_m5C_step-by-step_metadata/anno_to_base_remove_redundance_v1.0.py", line 158, in LINE = "\t".join([chr,pos_0,pos_1,dir,gene_id,trans_id,geneName.get(gene_id),isoformName.get(trans_id),type]) TypeError: sequence item 7: expected string, NoneType found here is my score file: lncRNA 2 rRNA_pseudogene 4 translated_unprocessed_pseudogene 3

Another choice here. Two step to modify the script: (1) find the lines: order = {'3prime_overlapping_ncrna' : 1, '3prime_overlapping_ncRNA': 1, ... } (2) insert the key:value pairs as that in the script. If it spends you too many days to solve the issue, I can send you the pre-built mm10 (ensembl r102) metadata.

Sorry for the late reply I followed your second method to run the script, it didn't work with the same error "TypeError: sequence item 7: expected string, NoneType found" Here is part of the script: 5NLWQ5TSN1_61](https://user-images.githubusercontent.com/79552887/170288338-fa4d4966-6a90-46b1-95bd-c16686de7fea.png) BTW, I run this script whit hg38 (ensembl 106). If you have the pre-built hg38 (ensembl 106) metadata, it is useful to me. It would be great if you could send me this type of metada to me. Thanks a lot!

If you have modified the code, just skip the --score options.

I am suddenly awared that this bug may from another issue: just check and delete the last empty line in the TXT.

Like this:

lncRNA[\t]5[\n] rRNA[\t]8[\n] [\n] <--- This one triggers the bug, delete it.

Sorry to have bothered you again. I still can't resolve this problem even I delet the last empty line here is the script: image

I checked the meatadata generating pipeline, and wonder if the error came from the input file? here is header of Homo_sapiens.GRCh38.106.base.sorted: image

here is header of Homo_sapiens.GRCh38.106.genelist: image

jhfoxliu commented 2 years ago

Maybe I can send you an updated script via email. My email is liujh26@mail2.sysu.edu.cn, could you send the .genelist file to that?

jz-Jiang commented 2 years ago

This error has been fixed with the help of Dr. Liu. The error comes from .genelist file generated from .noheader.gtf file which lack one transcript after the wrong header removal.

ponyaaaa commented 9 months ago

This error has been fixed with the help of Dr. Liu. The error comes from .genelist file generated from .noheader.gtf file which lack one transcript after the wrong header removal.

hello, Ihave also encountered similar issues recently. I would be very grateful if you could share some points to be cautious about when removing the header.