HobnobMancer / cazy_webscraper

Web scraper to retrieve protein data catalogued by the CAZy, UniProt, NCBI, GTDB and PDB websites/databases.
https://hobnobmancer.github.io/cazy_webscraper/
MIT License
12 stars 3 forks source link

Unexpected error message when retrieving AA UniProt sequences #114

Closed AlejandroSanchezCano closed 1 year ago

AlejandroSanchezCano commented 1 year ago

First I built the dabatase with: cazy_webscraper <email> --classes AA

Then I tried: cw_get_uniprot_data --families AA17 -s

And I got this output:

Built output directory: .cazy_webscraper_2023-04-25_17-38-06\uniprot_data_retrieval
Using default CAZy class synonyms
Retrieving GenBank accessions for selected CAZy classes: 0it [00:00, ?it/s]
Applying CAZy family filter(s)
Retrieving GenBank accessions for selected CAZy families:   0%|                                                                                  | 0/1 [00:00<?, ?it/s]Retrieving CAZymes for CAZy family AA17
Retrieving GenBank accessions for selected CAZy families: 100%|██████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.69it/s]
Applying no taxonomic filters
Retrieving UniProt data for 418
Retrieving data for 418 proteins
[['CCD28157.1', 'ETN25003.1', 'EGZ04327.1', 'EQC34366.1', 'ETM55527.1', 'EEY56117.1', 'ETL32367.1', 'ETL25332.1', 'ETO77111.1', 'ETK81747.1', 'ETO67284.1', 'ETL88378.1', 'ETN06075.1', 'ETI38762.1', 'ETK71917.1', 'ETO83077.1', 'CCA20830.1', 'EGZ04492.1', 'KDO27085.1', 'ETM02278.1', 'ETP30231.1', 'EQC39423.1', 'ETO81104.1', 'ETK81734.1', 'ETM00651.1', 'CCI47381.1', 'ETI42112.1', 'ETL41683.1', 'ETN20052.1', 'EEY58933.1', 'ETP25841.1', 'ETI48323.1', 'EGZ23522.1', 'ETP18130.1', 'EEY69088.1', 'ETN11075.1', 'EEY61639.1', 'EEY61638.1', 'ETK90881.1', 'ETK81744.1', 'KDO29253.1', 'ETK73643.1', 'ETN20049.1', 'ETP39574.1', 'ETO70369.1', 'ETL27076.1', 'ETK95850.1', 'ETN10519.1', 'ETK88975.1', 'EGZ05857.1', 'KDO26913.1', 'ETO83083.1', 'EGZ27273.1', 'ETI35283.1', 'ETI31514.1', 'ETL32408.1', 'UIZ22004.1', 'ETL85648.1', 'ETM00659.1', 'ETK81732.1', 'ETI41722.1', 'ETI41730.1', 'AHO49056.1', 'ETK95846.1', 'ETP53850.1', 'ETK88582.1', 'ETN10510.1', 'ETO77816.1', 'ETI41706.1', 'ETI32208.1', 'ETI50994.1', 'EGZ10739.1', 'EQC34755.1', 'ETP36460.1', 'KDO27086.1', 'ETK81751.1', 'EEY60789.1', 'EEY58927.1', 'UIZ27392.1', 'ETI41723.1', 'ETK88280.1', 'ETW01779.1', 'ETP53138.1', 'EGZ21313.1', 'CCA13926.1', 'EGZ10738.1', 'ETN25011.1', 'EQC33678.1', 'ETN20074.1', 'ETK95840.1', 'EGZ21309.1', 'ETP39568.1', 'ETL27074.1', 'ETI41715.1', 'ETI35061.1', 'ETL49218.1', 'ETL41675.1', 'ETM32606.1', 'ETN20054.1', 'ETO70332.1', 'ETM55516.1', 'ETI48679.1', 'ETN20047.1', 'KDO27114.1', 'ETN14473.1', 'ETI38761.1', 'ETO71194.1', 'ETI41714.1', 'ETM31823.1', 'EEY68485.1', 'ETM48357.1', 'ETN14460.1', 'ETN10508.1', 'ETM55515.1', 'ETP39587.1', 'ETL80314.1', 'EGZ23520.1', 'ETL85650.1', 'AIG55447.1', 'EEY58936.1', 'EGZ08731.1', 'EGZ21314.1', 'CCA17179.1', 'ETI32192.1', 'ETN10809.1', 'UIZ26027.1', 'EEY58932.1', 'EGZ27342.1', 'ETL88376.1', 'ETN24019.1', 'ETL49222.1', 'ETI56033.1', 'EQC42132.1', 'ETP53858.1', 'ETL35141.1', 'ETI56043.1', 'ETN19254.1', 'EGZ08727.1', 'ETP08653.1', 'ETP03147.1', 'ETI54329.1', 'ETK88959.1', 'ETP53851.1', 'ETM55520.1', 'ETK95844.1', 'EQC25604.1', 'EGZ21312.1', 'ETL25321.1', 'ETO84778.1', 'ETO84785.1'], ['ETK88962.1', 'UIZ24201.1', 'AIG55787.1', 'ETO62052.1', 'EQC25608.1', 'ETM48060.1', 'ETI41729.1', 'EEY54090.1', 'ETI56037.1', 'ETP52131.1', 'ETP08632.1', 'ETN25010.1', 'EEY58944.1', 'EGZ05561.1', 'ETO70366.1', 'EEY58939.1', 'EEY68486.1', 'ETN20053.1', 'ETL35158.1', 'ETL35153.1', 'ETI56084.1', 'ETK78975.1', 'ETM42059.1', 'ETK75541.1', 'ETM32479.1', 'ETL47559.1', 'ETP02046.1', 'AIG55790.1', 'ETN11038.1', 'ETL49225.1', 'EGZ07203.1', 'ETO70367.1', 'ETP25842.1', 'UIZ28766.1', 'ETM97400.1', 'AIG55491.1', 'ETP29504.1', 'EQC24790.1', 'UIZ25173.1', 'ETP53849.1', 'ETO83080.1', 'ETL35151.1', 'ETN24018.1', 'ETP11451.1', 'KAF4046070.1', 'ETK78977.1', 'ETN20064.1', 'EGZ21311.1', 'ETL94825.1', 'ETI54331.1', 'ETN14463.1', 'ETP30207.1', 'ETK72575.1', 'EEY59753.1', 'ETK78714.1', 'ETM38805.1', 'UIZ26903.1', 'ETI42600.1', 'ETI41724.1', 'ETM41652.1', 'ETL35135.1', 'ETV73941.1', 'ETP25843.1', 'ETL35162.1', 'UIZ21835.1', 'ETL35152.1', 'EGZ23516.1', 'ETP53854.1', 'UIZ26907.1', 'EEY55873.1', 'ETI31519.1', 'ETO70368.1', 'ETV73994.1', 'ETP36689.1', 'ETO67480.1', 'UIZ27394.1', 'EGZ05551.1', 'ETM02264.1', 'ETI54320.1', 'EGZ07202.1', 'AIG56201.1', 'EQC24659.1', 'ETI38512.1', 'ETV73954.1', 'EGZ05560.1', 'ETM55519.1', 'ETO67483.1', 'ETN14469.1', 'ETM31809.1', 'ETO70335.1', 'ETO67485.1', 'EQC26776.1', 'ETL32386.1', 'ETL88833.1', 'UIZ22002.1', 'ETM00653.1', 'EGZ08733.1', 'ETL94840.1', 'EGZ17951.1', 'ETI48337.1', 'EGZ07231.1', 'ETO77414.1', 'ETM41632.1', 'ETL79229.1', 'ETN20068.1', 'ETP52112.1', 'KDO29254.1', 'EEY58934.1', 'ETN00173.1', 'ETI31960.1', 'CCI46093.1', 'ETI44371.1', 'ETL88406.1', 'ETK95841.1', 'ETL95131.1', 'ETV73947.1', 'EQC24789.1', 'ETP46069.1', 'ETM02263.1', 'EQC25603.1', 'ETK81737.1', 'ETN20056.1', 'EGZ06334.1', 'ETP11442.1', 'ETV71159.1', 'EEY58926.1', 'EQC34358.1', 'ETW03002.1', 'ETN01693.1', 'ETP29981.1', 'AIG55448.1', 'ETK82650.1', 'EGZ27343.1', 'EEY67612.1', 'ETK81749.1', 'ETI41709.1', 'ETP39589.1', 'ETP18135.1', 'ETL78558.1', 'ETP24138.1', 'ETI48336.1', 'ETP11452.1', 'EGZ08724.1', 'ETN19515.1', 'ETI55312.1', 'ETP11455.1', 'ETI38498.1', 'ETP39588.1', 'EGZ27341.1', 'AIG56266.1'], ['ETP11448.1', 'ETP24139.1', 'UIZ26906.1', 'ETP46800.1', 'ETL49223.1', 'ETO59016.1', 'ETM41638.1', 'EGZ08732.1', 'ETI52336.1', 'ETK81748.1', 'UIZ24199.1', 'AIG55788.1', 'EGZ08725.1', 'ETM02269.1', 'EEY58931.1', 'ETL78541.1', 'ETL35539.1', 'ETN00163.1', 'ETN19290.1', 'ETP24148.1', 'AHO49057.1', 'EEY68484.1', 'ETI35285.1', 'ETK82180.1', 'ETP12314.1', 'ETM41647.1', 'ETO83056.1', 'ETO70334.1', 'EGZ05859.1', 'EGZ08736.1', 'ETP02071.1', 'ETN16020.1', 'ETO60944.1', 'ETP01315.1', 'ETO60224.1', 'EGZ08730.1', 'ETI35284.1', 'ETL27075.1', 'KDO27087.1', 'ETK78976.1', 'ETI56040.1', 'ETP18136.1', 'ETL88399.1', 'ETW02996.1', 'AIG55708.1', 'ETV73948.1', 'ETO70333.1', 'EGZ08739.1', 'ETP08651.1', 'EEY56114.1', 'EEY58928.1', 'EGZ04444.1', 'ETI56038.1', 'EEY67611.1', 'ETP33252.1', 'EEY58943.1', 'KDO29252.1', 'ETI41731.1', 'ETI38720.1', 'EGZ05556.1', 'ETI38760.1', 'ETL32154.1', 'UIZ25176.1', 'ETP11454.1', 'ETI42158.1', 'ETO63823.1', 'EGZ06395.1', 'ETK81741.1', 'EGZ08734.1', 'ETK88276.1', 'ETI31524.1', 'ETL95552.1', 'ETK71888.1', 'EGZ05562.1', 'ETW02997.1', 'EGZ08128.1', 'ETO70341.1', 'ETL94839.1', 'ETK71904.1', 'ETI56039.1', 'ETI41716.1', 'ETM48076.1', 'ETI31959.1', 'ETL80307.1', 'ETL88392.1', 'ETP11446.1', 'EEY65096.1', 'KDO30778.1', 'ETM41641.1', 'ETO70735.1', 'KDO27115.1', 'ETM32478.1', 'ETI48338.1', 'EQC42107.1', 'ETO70329.1', 'ETN20072.1', 'ETP28200.1', 'ETI54319.1', 'EGZ05552.1', 'ETN14458.1', 'ETP39581.1', 'KDO27089.1', 'ETI54333.1', 'ETO83057.1', 'ETK81755.1', 'ETP46074.1', 'EGZ27274.1', 'ETL47567.1', 'ETP25849.1', 'EGZ08747.1', 'ETM55513.1', 'ETI41700.1', 'ETP50074.1', 'EGZ06330.1', 'AIG55793.1', 'ETP25844.1', 'ETI41704.1', 'ETO84776.1']]
Batch retrieving protein data from UniProt:   0%|                                                                                                | 0/3 [00:00<?, ?it/s]WARNING [bioservices.UniProt:596]:  status is not ok with Forbidden
Batch retrieving protein data from UniProt:   0%|                                                                                                | 0/3 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\alexs\anaconda3\envs\ai\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\alexs\anaconda3\envs\ai\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\alexs\anaconda3\envs\ai\Scripts\cw_get_uniprot_data.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "C:\Users\alexs\anaconda3\envs\ai\lib\site-packages\cazy_webscraper\expand\uniprot\get_uniprot_data.py", line 147, in main
    downloaded_uniprot_data, all_ecs = get_uniprot_data(gbk_data_to_download, cache_dir, args)
  File "C:\Users\alexs\anaconda3\envs\ai\lib\site-packages\cazy_webscraper\expand\uniprot\get_uniprot_data.py", line 348, in get_uniprot_data
    uniprot_df = UniProt().get_df(entries=query, limit=args.uniprot_batch_size)
  File "C:\Users\alexs\anaconda3\envs\ai\lib\site-packages\bioservices\uniprot.py", line 851, in get_df
    res = self.search(
  File "C:\Users\alexs\anaconda3\envs\ai\lib\site-packages\bioservices\uniprot.py", line 744, in search
    batch = batch.split("\n")[1:]
AttributeError: 'int' object has no attribute 'split'
HobnobMancer commented 1 year ago

Hi! Thanks for using cazy_webscraper.

Looking at the error message, this is a duplicate of issue #111:

  File "C:\Users\alexs\anaconda3\envs\ai\lib\site-packages\bioservices\uniprot.py", line 744, in search
    batch = batch.split("\n")[1:]
AttributeError: 'int' object has no attribute 'split'

The error message is the result of a bioservices process not cazy_webscraper. Make sure you are using the latest version of bioservices.

Side note: We are currently altering cazy_webscraper so that it not longer uses bioservices.UniProt().get_df() method, and we are migrating to bioservices.UniProt().mapping() which will be faster and reduce the burden on the UniProt Rest API (and will also mean cazy_webscraper won't be using the section of bioservices code that keeps causing issues) -- progress is over on PR #115 We've both been busy atm, so progress is slower than expected. We still need to run a few more test runs, and update the unit tests prior releasing the update.

AlejandroSanchezCano commented 1 year ago

Hello! Thank you for the quick response. I am using version 1.11.2, which is the latest version of bioservices.