THUNLP-MT / dyMEAN

This repo contains the codes for our paper "End-to-End Full-Atom Antibody Design"
https://arxiv.org/abs/2302.00203
MIT License
89 stars 8 forks source link

How to deal with the problem when starting data preprocess? #4

Open liusfore opened 1 year ago

liusfore commented 1 year ago

(dyMEAN) dell@dell-Precision-7920-Tower:/mnt/e/code/dyMEAN$ bash scripts/data_preprocess.sh all_structures/imgt all_data Locate the project folder at /mnt/e/code/dyMEAN Processing SAbDab with output directory /mnt/e/code/dyMEAN/all_data Processing RAbD with output directory /mnt/e/code/dyMEAN/all_data/RAbD 2023-06-15 15:59:18::INFO::Namespace(fout='/mnt/e/code/dyMEAN/all_data/rabd_all.json', n_cpu=4, numbering='imgt', pdb_dir='/mnt/e/code/dyMEAN/all_structures/imgt', pre_numbered=True, summary='/mnt/e/code/dyMEAN/all_data/sabdab_all.json', type='rabd') 2023-06-15 15:59:18::INFO::download rabd from summary file /mnt/e/code/dyMEAN/all_data/sabdab_all.json 2023-06-15 15:59:18::INFO::Extracting summary to json format Traceback (most recent call last): File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/e/code/dyMEAN/data/download.py", line 376, in main(parse()) File "/mnt/e/code/dyMEAN/data/download.py", line 360, in main items = read_rabd(fpath) File "/mnt/e/code/dyMEAN/data/download.py", line 94, in read_rabd with open(fpath, 'r') as fin: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/sabdab_all.json' Traceback (most recent call last): File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/e/code/dyMEAN/data/split.py", line 249, in main(parse()) File "/mnt/e/code/dyMEAN/data/split.py", line 72, in main items = load_file(args.data) File "/mnt/e/code/dyMEAN/data/split.py", line 37, in load_file with open(fpath, 'r') as fin: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/sabdab_all.json'

kxz18 commented 1 year ago

Hi~ Sorry for the mistake in the data_preprocess.sh. I accidentally commented the processing logic for SAbDab, which leads to the absence of sabdab_all.json. Now I've uncommented them. Could you please run the script again to see if the problem is solved? I think it should be fine now.

liusfore commented 1 year ago

It seems that bug still happens. The file sabdab_all.json has not been generated. (dyMEAN) dell@dell-Precision-7920-Tower:/mnt/e/code/dyMEAN$ bash scripts/data_preprocess.sh all_structures/imgt all_data Locate the project folder at /mnt/e/code/dyMEAN Processing SAbDab with output directory /mnt/e/code/dyMEAN/all_data 2023-06-16 11:17:59::INFO::Namespace(fout='/mnt/e/code/dyMEAN/all_data/sabdab_all.json', n_cpu=4, numbering='imgt', pdb_dir='/mnt/e/code/dyMEAN/all_structures/imgt', pre_numbered=True, summary='summaries/sabdab_summary.tsv', type='sabdab') 2023-06-16 11:17:59::INFO::download sabdab from summary file summaries/sabdab_summary.tsv 2023-06-16 11:17:59::INFO::Extracting summary to json format 2023-06-16 11:18:00::INFO::Start downloading pdbs in the summary 2023-06-16 11:18:00::INFO::using local PDB files: /mnt/e/code/dyMEAN/all_structures/imgt 2023-06-16 11:18:00::INFO::Assume PDB file already renumbered with scheme imgt 2023-06-16 11:18:00::INFO::downloading raw files 6%|████████▎ | 390/6741 [00:00<00:10, 613.14it/s]6B3M not found in /mnt/e/code/dyMEAN/all_structures/imgt, try fetching from remote server 6TNP not found in /mnt/e/code/dyMEAN/all_structures/imgt, try fetching from remote server 6QXE not found in /mnt/e/code/dyMEAN/all_structures/imgt, try fetching from remote server 5FUU not found in /mnt/e/code/dyMEAN/all_structures/imgt, try fetching from remote server 2023-06-16 11:18:04::WARN::Trying for the 2 times 2023-06-16 11:18:05::WARN::Trying for the 3 times fetched 6DZT not found in /mnt/e/code/dyMEAN/all_structures/imgt, try fetching from remote server fetched 2023-06-16 11:18:06::WARN::Trying for the 4 times fetched 7%|█████████▊ | 457/6741 [00:06<02:46, 37.78it/s]2023-06-16 11:18:07::WARN::Trying for the 5 times 2023-06-16 11:18:08::WARN::Get https://files.rcsb.org/download/5FUU.pdb failed 15%|█████████████████████▍ | 1013/6741 [00:08<00:45, 125.59it/s] fetched Traceback (most recent call last): File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/e/code/dyMEAN/data/download.py", line 376, in main(parse()) File "/mnt/e/code/dyMEAN/data/download.py", line 369, in main items = download(items, out_path, args.n_cpu, args.pdb_dir, args.numbering, args.pre_numbered) File "/mnt/e/code/dyMEAN/data/download.py", line 280, in download valid_entries = thread_map(map_func, items, max_workers=ncpu) File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map return _executor_map(ThreadPoolExecutor, fn, *iterables, tqdm_kwargs) File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), *kwargs)) File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/site-packages/tqdm/std.py", line 1178, in iter for obj in iterable: File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator yield fs.pop().result() File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/concurrent/futures/_base.py", line 444, in result return self.get_result() File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/concurrent/futures/_base.py", line 389, in get_result raise self._exception File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(self.args, self.kwargs) File "/mnt/e/code/dyMEAN/data/download.py", line 189, in download_one_item_local from_remote = fetch_from_pdb(pdb_id) File "/mnt/e/code/dyMEAN/data/download.py", line 35, in fetch_from_pdb data['pdb'] = text.text AttributeError: 'NoneType' object has no attribute 'text' Processing RAbD with output directory /mnt/e/code/dyMEAN/all_data/RAbD 2023-06-16 11:18:09::INFO::Namespace(fout='/mnt/e/code/dyMEAN/all_data/rabd_all.json', n_cpu=4, numbering='imgt', pdb_dir='/mnt/e/code/dyMEAN/all_structures/imgt', pre_numbered=True, summary='/mnt/e/code/dyMEAN/all_data/sabdab_all.json', type='rabd') 2023-06-16 11:18:09::INFO::download rabd from summary file /mnt/e/code/dyMEAN/all_data/sabdab_all.json 2023-06-16 11:18:09::INFO::Extracting summary to json format Traceback (most recent call last): File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/e/code/dyMEAN/data/download.py", line 376, in main(parse()) File "/mnt/e/code/dyMEAN/data/download.py", line 360, in main items = read_rabd(fpath) File "/mnt/e/code/dyMEAN/data/download.py", line 94, in read_rabd with open(fpath, 'r') as fin: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/sabdab_all.json' Traceback (most recent call last): File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/e/code/dyMEAN/data/split.py", line 249, in main(parse()) File "/mnt/e/code/dyMEAN/data/split.py", line 72, in main items = load_file(args.data) File "/mnt/e/code/dyMEAN/data/split.py", line 37, in load_file with open(fpath, 'r') as fin: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/sabdab_all.json' 2023-06-16 11:18:10::INFO::No meta-info file found, start processing Traceback (most recent call last): File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/e/code/dyMEAN/data/dataset.py", line 323, in dataset = E2EDataset(args.dataset, args.save_dir, num_entry_per_file=-1) File "/mnt/e/code/dyMEAN/data/dataset.py", line 119, in init self.preprocess(file_path, save_dir, num_entry_per_file) File "/mnt/e/code/dyMEAN/data/dataset.py", line 192, in preprocess with open(file_path, 'r') as fin: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/RAbD/test.json' 2023-06-16 11:18:11::INFO::No meta-info file found, start processing Traceback (most recent call last): File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/e/code/dyMEAN/data/dataset.py", line 323, in dataset = E2EDataset(args.dataset, args.save_dir, num_entry_per_file=-1) File "/mnt/e/code/dyMEAN/data/dataset.py", line 119, in init self.preprocess(file_path, save_dir, num_entry_per_file) File "/mnt/e/code/dyMEAN/data/dataset.py", line 192, in preprocess with open(file_path, 'r') as fin: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/RAbD/valid.json' ^CTraceback (most recent call last): File "/mnt/e/code/dyMEAN/data/dataset.py", line 104, in init with open(metainfo_file, 'r') as fin: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/RAbD/train_processed/_metainfo'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/e/code/dyMEAN/data/dataset.py", line 323, in dataset = E2EDataset(args.dataset, args.save_dir, num_entry_per_file=-1) File "/mnt/e/code/dyMEAN/data/dataset.py", line 110, in init print_log('No meta-info file found, start processing', level='INFO') File "/mnt/e/code/dyMEAN/utils/logger.py", line 34, in print_log print(s, end=end) KeyboardInterrupt 2023-06-16 11:18:12::INFO::No meta-info file found, start processing

kxz18 commented 1 year ago

Looks like this is because the pdb of 5FUU is no longer available in the PDB database, which causes error in fetching it from the network. I've add a branch to detect such error. I've tested it, now it should be fine.