FePhyFoFum / PyPHLAWD

Python version of PHLAWD
GNU General Public License v2.0
21 stars 7 forks source link

Problem creating seq files when running setup_clade_ap.py. #48

Open teagerv opened 2 years ago

teagerv commented 2 years ago

Question Where is the -s parameter (SEQGZFOLDER) for setup_clade_ap.py meant to point?

Issue: I seem to be having a problem populating the gzip directory with sequences. The .table file is all populated from the ncbi db, but it's not finding the sequences. I'm not sure where the -s parameter is supposed to be pointing maybe? ~/ is where all the compressed ncbi files are from phlawd_db_maker.

snail@snailbuntu:~/PyPHLAWD/src$ python3 setup_clade_ap.py -t Architaenioglossa -b /media/snail/RED1/ncbi/inv.db -o ~/Desktop/ -s ~/ -l ~/Desktop/logfile
STARTING PYPHLAWD *。ヾ(。>v<。)ノ゙*。
MAKING TREE Architaenioglossa ٩(๑꒦ິȏ꒦ິ๑)۶
MAKING DIRS IN /home/snail/Desktop ヽ(*´∀`)ノ゙
PROBLEM CREATING /home/snail/Desktop/Architaenioglossa_75116 (´;ω;`)
POPULATING DIRS /home/snail/Desktop ヽ/❀o ل͜ o\ノ
Traceback (most recent call last):
  File "/home/snail/PyPHLAWD/src/populate_dirs_first.py", line 47, in <module>
    mfid_in(tid,DB,dirl+dirr+"/"+orig+".fas",dirl+dirr+"/"+orig+".table",gzfileloc,True,limitlist = taxalist) 
  File "/home/snail/PyPHLAWD/src/get_subset_genbank.py", line 275, in make_files_with_id_internal
    idstoseq = get_seqs_from_gz(gzfileloc,fn,files_ids[fn])
  File "/home/snail/PyPHLAWD/src/get_subset_genbank.py", line 24, in get_seqs_from_gz
    fl = gzip.open(gzdir+"/"+filename,"r")
  File "/usr/lib/python3.8/gzip.py", line 58, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/usr/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/snail//seqs.Viviparus subpurpureus voucher USNM 1292588 histone 3 (H3) gene, partial cds.'
CREATED TEMPDIR_44273/
CLUSTERING SINGLE /home/snail/Desktop/Architaenioglossa_75116/Cyclophoroidea_75117/Megalomastomatidae_928797/Acroptychia_928777 ヽ(。´・д・)ノ
Traceback (most recent call last):
  File "/home/snail/PyPHLAWD/src/cluster_tree.py", line 38, in <module>
    tablename = [x for x in files if ".table" in x][0]
IndexError: list index out of range
PYPHLAWD DONE ヽ(^□^。)ノ
Total time (H:M:S): 0:00:00.638717 ٩(º౪º๑)۶
(⌐■_■) 

Steps taken: Followed the steps on the Install page. Built phlawd_db_maker and all dependencies without errors. Built the database with phlawd_db_maker with no errors. Followed directions on the Runs page for a clustering analysis. Python version is 3.8.10

I know Python pretty well, so if I find a fix I'll make a pull request.

hmarx commented 2 years ago

I'm having this same issue on Python 3.9.13. Have there been any updates?

teagerv commented 2 years ago

Solution: I figured it out, you have to make a file with the NCBI ids that you want to include if you're subsetting taxa, or it won't populate with any sequences (this is described in the 'Runs' doc). Don't know why I decided that wasn't relevant last time I looked at this...

There is a helper script if you already have a file with all the names, but I just used a quick BioPython script to pull them and it's running now:

from Bio import Entrez

def main():
    Entrez.email = ""
    db_type = 'nucleotide'
    search_terms = '(Architaenioglossa[Orgn])'
    output_file = '/home/snail/Desktop/architaenioglossa_taxalist.txt'

    returned_ids = esearch(search_terms, db_type)
    make_taxalist(returned_ids, output_file)

    return

def esearch(search_terms, db_type):

    handle = Entrez.esearch(db=db_type, term = search_terms, idtype="acc", retmax = )
    record = Entrez.read(handle)
    print('Search returned %s results.\n' %record["Count"])

    ids = record["IdList"]

    return ids

def make_taxalist(ids, output):

    with open(output, 'a') as fh:

        for i in ids:
            fh.write(f'{i}\n')

    return

if __name__ == '__main__':
    main()

Just set your search terms to the subset you want, set retmax to at least the number of taxa, and put in a random email (not sure if this is required).

YingyingYang2019 commented 1 year ago

Hi, I have the same problems! And I have provided the taxalist, still does work! Does anyone can help? Thanks! The code and results are shown here:

yang@bdchxy-PowerEdge-M630-VRTX:~$ python application/PyPHLAWD-master/src/setup_clade_ap.py -t Fagales -b /storage/phlawd_db_maker-master/DB/pln.db -s /storage/phlawd_db_maker-master/DB -o application/PyPHLAWD-master/examples/clustered/ -l application/PyPHLAWD-master/examples/clustered/ -f ncbi_sp_ids_938.txt

STARTING PYPHLAWD (⌯꒪͒ ꌂ̇ ꒪͒) LIMITING TO TAXA IN ncbi_sp_ids_938.txt MAKING TREE Fagales (✧ ꒪◞౪◟꒪) MAKING DIRS IN application/PyPHLAWD-master/examples/clustered ヾ(≧∪≦)ノ〃 PROBLEM CREATING application/PyPHLAWD-master/examples/clustered/Fagales_3502 (゜´Д`゜) POPULATING DIRS application/PyPHLAWD-master/examples/clustered ₊·◟(˶╹̆ꇴ╹̆˵)◜‧・ Traceback (most recent call last): File "/home/yang/application/PyPHLAWD-master/src/populate_dirs_first.py", line 47, in mfid_in(tid,DB,dirl+dirr+"/"+orig+".fas",dirl+dirr+"/"+orig+".table",gzfileloc,True,limitlist = taxalist) File "/home/yang/application/PyPHLAWD-master/src/get_subset_genbank.py", line 275, in make_files_with_id_internal idstoseq = get_seqs_from_gz(gzfileloc,fn,files_ids[fn]) File "/home/yang/application/PyPHLAWD-master/src/get_subset_genbank.py", line 24, in get_seqs_from_gz fl = gzip.open(gzdir+"/"+filename,"r") File "/home/yang/anaconda3/envs/python3.8/lib/python3.8/gzip.py", line 58, in open binary_file = GzipFile(filename, gz_mode, compresslevel) File "/home/yang/anaconda3/envs/python3.8/lib/python3.8/gzip.py", line 173, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/storage/phlawd_db_maker-master/DB//seqs.Ticodendron incognitum chloroplast rbcL gene for ribulose-1,5-bisphosphate carboxylase large subunit, partial cds.' CREATED TEMPDIR_69418/ CLUSTERING SINGLE application/PyPHLAWD-master/examples/clustered/Fagales_3502/Fagaceae_3503/Chrysolepis_21022 (ノ′Дヾ) Traceback (most recent call last): File "/home/yang/application/PyPHLAWD-master/src/cluster_tree.py", line 38, in tablename = [x for x in files if ".table" in x][0] IndexError: list index out of range PYPHLAWD DONE ٩(๑˃́ꇴ˂̀๑)۶ Total time (H:M:S): 0:00:06.033473 ◦°˚(❛‿❛)/˚°◦ (⌐■_■)

bheimbu commented 1 year ago

Hi and a happy new year,

I'm experiencing the same issue, any help would be highly appreciated?!

It would also be nice if the website (https://fephyfofum.github.io/PyPHLAWD/) could be updated as there is no more setup_clade.py (which is now called setup_clade_ap.py).

Cheers Bastian

YingyingYang2019 commented 1 year ago

Hi bheimubu! Happy new year! For this question " I'm experiencing the same issue, any help would be highly appreciated?! It would also be nice if the website (https://fephyfofum.github.io/PyPHLAWD/) could be updated as there is no more setup_clade.py (which is now called setup_clade_ap.py).", mine works with the old version PyPhlawd. Therefore, if you have an old version, you could try. The new version doesn't work well this time. Good luck!

Yingyya

bheimbu commented 1 year ago

Hi @YingyingYang2019,

you make my day, it's working with the old version (downloaded as source code from here).

Cheers Bastian

harsimranpadam commented 11 months ago

Hi. I would just like to add that I was having the same trouble. If there is anything you figure out, please keep me updated. I also couldn't understand how to have the genus & sequence for this. If that is possible, please let me know. The code is here, in which I am running trouble in:

python3 setup_clade_ap.py -t Laurales -b /Users/administrator_ge/Desktop/pln.db -s /Users/administrator_ge/Desktop/seq -o /Users/administrator_ge/Desktop/output -l /Users/administrator_ge/Desktop/logfile.md.gz -f /Users/administrator_ge/Desktop/taxalist.txt

STARTING PYPHLAWD ٩(⚙ȏ⚙)۶ LIMITING TO TAXA IN /Users/administrator_ge/Desktop/taxalist.txt MAKING TREE Laurales ╰(✧∇✧)╯ MAKING DIRS IN /Users/administrator_ge/Desktop/output Σ(ノ°▽°)ノ PROBLEM CREATING /Users/administrator_ge/Desktop/output/Laurales_3432 (;へ:) POPULATING DIRS /Users/administrator_ge/Desktop/output Σ(*ノ´>ω<。`)ノ Traceback (most recent call last): File "/Users/administrator_ge/apps/PyPHLAWD/src/populate_dirs_first.py", line 47, in mfid_in(tid,DB,dirl+dirr+"/"+orig+".fas",dirl+dirr+"/"+orig+".table",gzfileloc,True,limitlist = taxalist) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/administrator_ge/apps/PyPHLAWD/src/get_subset_genbank.py", line 275, in make_files_with_id_internal idstoseq = get_seqs_from_gz(gzfileloc,fn,files_ids[fn]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/administrator_ge/apps/PyPHLAWD/src/get_subset_genbank.py", line 24, in get_seqs_from_gz fl = gzip.open(gzdir+"/"+filename,"r") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py", line 58, in open binary_file = GzipFile(filename, gz_mode, compresslevel) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py", line 174, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: '/Users/administrator_ge/Desktop/seq//seqs.Hernandia nymphaeifolia trnL-trnF intergenic spacer region and trnF gene, partial sequence; chloroplast gene for chloroplast product.' CREATED TEMPDIR_77128/ CLUSTERING SINGLE /Users/administrator_ge/Desktop/output/Laurales_3432/Hernandiaceae_22009/Gyrocarpus_13552 (ノдヽ) Traceback (most recent call last): File "/Users/administrator_ge/apps/PyPHLAWD/src/cluster_tree.py", line 38, in tablename = [x for x in files if ".table" in x][0]


IndexError: list index out of range
PYPHLAWD DONE ୧༼✿ ͡◕ д ◕͡ ༽୨
Total time (H:M:S): 0:01:01.869942 ヽ(^o^)丿
(⌐■_■)