parse cif to train structured data error

FlorientHuang commented 1 month ago

/HOME/scw6dlr/huangsuyuan/ProteinMPNN-main/training/parse_cif_noX.py:233: SyntaxWarning: invalid escape sequence '\('
  for expression in re.split('\(|\)', oper_expression) if expression]
Traceback (most recent call last):
  File "/HOME/scw6dlr/.conda/envs/mlfold/lib/python3.12/site-packages/pdbx/reader/PdbxReader.py", line 360, in __tokenizer
    line = next(fileIter)
           ^^^^^^^^^^^^^^
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/HOME/scw6dlr/huangsuyuan/ProteinMPNN-main/training/parse_cif_noX.py", line 457, in <module>
    chains,metadata = parse_mmcif(IN)
                      ^^^^^^^^^^^^^^^
  File "/HOME/scw6dlr/huangsuyuan/ProteinMPNN-main/training/parse_cif_noX.py", line 274, in parse_mmcif
    reader.read(data)
  File "/HOME/scw6dlr/.conda/envs/mlfold/lib/python3.12/site-packages/pdbx/reader/PdbxReader.py", line 72, in read
    self.__parser(self.__tokenizer(self.__ifh), containerList)
  File "/HOME/scw6dlr/.conda/envs/mlfold/lib/python3.12/site-packages/pdbx/reader/PdbxReader.py", line 275, in __parser
    curCatName, curAttName, curQuotedString, curWord = next(tokenizer)
                                                       ^^^^^^^^^^^^^^^
RuntimeError: generator raised StopIteration

Is this a package mismatch?

FlorianWieser1 commented 2 weeks ago

Dear FlorientHuang, I have the same issue. I installed https://pypi.org/project/pdbx-mmcif/ and I'm passing a gzipped mmcif from the Protein Databank to the script: python parse_cif_noX.py 1UBQ.cif.gz out. Could you fix the error? What am I missing. Thanks a lot in advance!

FlorianWieser1 commented 2 weeks ago

The issue is likely that we are using a too recent python version. I could fix the issue for me by adding an exception in __tokenizer() method of mlfold/lib/python3.10/site-packages/pdbx/reader/PdbxReader.py.

        # Tokenizer loop begins here ---
        while True:
            try:
                line = next(fileIter)
            except StopIteration:
                return
            self.__curLineNumber += 1

dauparas / ProteinMPNN

parse cif to train structured data error #114