JiekaiLab / scTE

MIT License
97 stars 27 forks source link

Issues building an index for a custom genome (Turquoise killifish) #22

Open BenayounLaboratory opened 3 years ago

BenayounLaboratory commented 3 years ago

Hi - thank you for creating such a unique/useful tool! After testing the pipeline on one of the "out of the box" supported genomes (mm10) successfully, we are trying to use it to measure TE expression in single cells for the African turquoise killifish. We are using the NCBI genome annotation files (gtf) and ran repeatmasker to get the TE bed file. Everything looks ok based on the sample files provided here on Github. We have the most recent version of the package, and we use this command to build the index:

scTE_build -te 2015_Genome_scTE.bed -gene GCF_001465895.1_Nfu_20140520_genomic_CLEAN_MT_exon.filtered.gtf -o Nfu_20140520 -g other

Although the run starts ok, we get a warning almost immediately that does not stop the run:

/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build:110: DeprecationWarning: 'U' mode is deprecated
  o = open(tefile,'rU')

However, after a few minutes, the program aborts with the following error:

Traceback (most recent call last):
  File "/usr/local/bin/scTE_build", line 4, in <module>
    __import__('pkg_resources').run_script('scTE==1.0', 'scTE_build')
  File "/usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 651, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1448, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build", line 468, in <module>
    main()
  File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build", line 461, in main
    genomeIndex(args.genome,args.mode,tefile,genefile, args.out,'No path','No path')
  File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build", line 127, in genomeIndex
    gls.load_list(clean)
  File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/scTE/miniglbase/genelist.py", line 1472, in load_list
    list_to_load[0]
IndexError: list index out of range

Would you be able to assist in figuring out what the problem is? We are very excited to use the package on the African turquoise killifish data but won't be able to until we can have an index... Thank you so much for your help in advance!

jphe commented 3 years ago

Can you check the 2015_Genome_scTE.bed file is 6 columns bed format?

BenayounLaboratory commented 3 years ago

Hi,

yes. This is what the beginning of the file looks like:

NC_029649.1 0   910 NotFur1_Unknown_553 0   -
NC_029649.1 908 1184    NotFur1_rnd-1_family-367    0   +
NC_029649.1 1175    1836    NotFur1_rnd-1_family-330    0   +
NC_029649.1 1836    1878    NotFur1_Unknown_695 0   -
NC_029649.1 1925    2016    NotFur1_Unknown_695 0   -
NC_029649.1 2017    2142    NotFur1_Unknown_176 0   -
NC_029649.1 2137    2341    NotFur1_GapFilledScaffold_7542_21782    0   -
NC_029649.1 2396    2784    NotFur1_GapFilledScaffold_11762_4388    0   -
NC_029649.1 3549    3722    NotFur1_rnd-1_family-25 0   -
NC_029649.1 3650    3731    NotFur1_LTR_154 0   -
NC_029649.1 3808    3858    NotFur1_GapFilledScaffold_5446_39464    0   +
NC_029649.1 4004    4101    NotFur1_rnd-1_family-22 0   +
NC_029649.1 4018    4124    NotFur1_GapFilledScaffold_5294_191  0   +
huanglu2018 commented 3 years ago

I encountered the same error today, after checking the code in scTE_build, I found the cause of the error:

  1. The readGtf() function in scTE_build script only expects regular chromosome name (1, 2, 3, etc) from .gtf files chr_list = [ str(k) for k in list(range(1,50))] + ['X','Y', 'M'] #line14

  2. readGtf() expects 'gene_biotype' information offered in the last column of .gtf files, which it can extract exons from protein_coding and lincRNA genes (useful in exclusive mode)

hope my experience be helpful

BenayounLaboratory commented 3 years ago

Thank you huangly2018! We have the biotype there, so that's probably not creating the issue. Developers: would there be a way to modify the code so as to be compatible with model organisms that have non-standard chromosomes names?

jphe commented 3 years ago

Yes, just as huangly2018 pointed out, scTE limited the chromosome names to avoid any unwanted bugs.

you can replace the line14 of scTE_buld chr_list = [ str(k) for k in list(range(1,50))] + ['X','Y', 'M'] with the list of your non-standard chromosome names chr_list =['NC_029649.1',...], then it should be OK.

BenayounLaboratory commented 3 years ago

Hi Jphe thanks for answering! Unfortunately, this is a draft genome with > 10,000 scaffolds, so the manual listing is just not practical. Would there be a way for the program to read the list of scaffolds from a file, or maybe deduce it from the gtf gene annotation file? If adding 1-2 lines of code can do that, could you please assist? I am not proficient with python. Thank you for your assistance - again, many people using nonstandard model organisms would be able to use the program if this is possible to do... Thank you!

jphe commented 3 years ago

Can you send the TE bed file and gene gtf file to me, then I can try and fix it in my PC

BenayounLaboratory commented 3 years ago

The 2 files are available there: https://www.dropbox.com/sh/50iccy9qp3a1hyp/AAC-TLmQ7buWeJitJRwHb2dCa?dl=0

I think this would help all people working on non-perfect genome assemblies - thank you so much!

BenayounLaboratory commented 3 years ago

Hey - sorry to bother you. Any updates on using a genome without the standard chromosome names? Thank you so much!

jphe commented 3 years ago

Can you try to replace the scTE/bin/scTE_build script with the attached file, it works in my PC.

You need to change the line16 with the absolute path of your bed/gtf file. image

image

scTE_build.zip

BenayounLaboratory commented 3 years ago

Thank you!

Sorry to bug you again - when I replace the file in my install as you suggested, I get this error:

Traceback (most recent call last):
  File "/usr/local/bin/scTE_build", line 12, in <module>
    from scTE.miniglbase import genelist, glload, location

Any idea why?

Thank you!

jphe commented 2 years ago

It's wired, may I make sure that you replace the old scTE_build.py file with new one, and then reinstall scTE and then get that error.

And I can not see the the full error information that you get.

xiangyupan commented 2 years ago

Hi @jphe , Thanks for the excellent work. When I use the scTE_build to build the index for Axolotl genome, the same error occurred with BenayounLaboratory who mentioned before. I followed the motified scTE_build.zip you uploaded and verified the gtf has the same chromsome names within the TE bed file with 6 coloumns. The command and error are just as follow. python3 scTE_build -te Axolotl.scTE.10000.bed -gene AmexT_v47-AmexG_v6.0-DD.gtf -o custom
INFO : genelist(): loaded 'Axolotl.scTE.bed' found 28728752 items INFO : genelist(): loaded 'AmexT_v47-AmexG_v6.0-DD.gtf' found 1741461 items Namespace(tefile=['Axolotl.scTE.bed'], genefile=['AmexT_v47-AmexG_v6.0-DD.gtf'], mode='exclusive', out='custom', genome='other', info=<function info at 0x2ad49cdc83a0>) INFO : Building the scTE genome annotation index... 2022-08-16 22:22:04 /public1/home/sc60481/software/scTE/scTE_build:114: DeprecationWarning: 'U' mode is deprecated o = open(tefile,'rU') Traceback (most recent call last): File "/public1/home/sc60481/software/scTE/scTE_build", line 472, in <module> main() File "/public1/home/sc60481/software/scTE/scTE_build", line 465, in main genomeIndex(args.genome,args.mode,tefile,genefile, args.out,'No path','No path') File "/public1/home/sc60481/software/scTE/scTE_build", line 131, in genomeIndex gls.load_list(clean) File "/public1/home/sc60481/software/scTE/scTE/miniglbase/genelist.py", line 1472, in load_list list_to_load[0] IndexError: list index out of range I feel confused about how to solve this problem. Could you help me with this issue? For your convenient, I upload the two files for you to test. The TE file is huge in axolotl, then I intercept 10000 lines for test. https://cowtransfer.com/s/5f6927ec36fa46 The password is 0paaqv. Thanks for your time and work.