RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.15k stars 555 forks source link

`Graph.parse` fails to parse TemporaryFile #2357

Open ripry opened 1 year ago

ripry commented 1 year ago

Hi, there! I'm using RDFLib with FastAPI. The following code passed the file received by FastAPI's UploadFile to RDFLib:

from fastapi import APIRouter, UploadFile
from rdflib import Graph

router = APIRouter()

@router.post('/parse')
async def parse(file: UploadFile):
    graph = Graph().parse(file=file.file)

Then I got the following error:

File "/usr/local/lib/python3.11/site-packages/rdflib/graph.py", line 1470, in parse
  source = create_input_source(
           ^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/rdflib/parser.py", line 428, in create_input_source
  input_source = FileInputSource(file)
                 ^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/rdflib/parser.py", line 316, in __init__
  system_id = URIRef(pathlib.Path(file.name).absolute().as_uri(), base=base)  # type: ignore[union-attr]
                     ^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/pathlib.py", line 871, in __new__
  self = cls._from_parts(args)
         ^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/pathlib.py", line 509, in _from_parts
  drv, root, parts = self._parse_args(args)
                     ^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/pathlib.py", line 493, in _parse_args
  a = os.fspath(a)
      ^^^^^^^^^^^^

I think it is because FastAPI uses SpooledTemporaryFile in UploadFile entity and the name can be an int. (ref: https://github.com/python/cpython/issues/62095) So we need to modify the part where the system_id = URIRef(pathlib.Path(file.name).absolute().as_uri(), base=base) file path is created. But I can't think of a good idea...

ripry commented 1 year ago

Workaround is here:

from tempfile import NamedTemporaryFile

from fastapi import APIRouter, UploadFile
from rdflib import Graph

router = APIRouter()

@router.post('/parse')
async def parse(file: UploadFile):
    with NamedTemporaryFile() as tmp:
        with open(tmp.name, "w+b") as writer:
            for chunk in file.file:
                writer.write(chunk)
        with open(tmp.name, "r+b") as reader:
            graph = Graph().parse(file=reader)
aucampia commented 1 year ago

Thanks for reporting things @ripry, it does seem like something is wrong here.

Could you try this though:

from fastapi import APIRouter, UploadFile
from rdflib import Graph

router = APIRouter()

@router.post('/parse')
async def parse(file: UploadFile):
    graph = Graph().parse(source=file.file)

I think it should work, but not sure. Either way, I think this is a bug.

ripry commented 1 year ago

@aucampia Thanks for the reply! I have already tried Graph().parse(source=file.file). The following error occurs:

  File "path/to/source/app.py", line 6, in parse
    graph = Graph().parse(source=file.file)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rdflib/graph.py", line 1494, in parse
    parser.parse(source, self, **args)
  File "/usr/local/lib/python3.11/site-packages/rdflib/plugins/parsers/notation3.py", line 2015, in parse
    baseURI = graph.absolutize(source.getPublicId() or source.getSystemId() or "")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rdflib/graph.py", line 1225, in absolutize
    return self.namespace_manager.absolutize(uri, defrag)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rdflib/namespace/__init__.py", line 722, in absolutize
    result = urljoin("%s/" % base, uri, allow_fragments=not defrag)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/parse.py", line 521, in urljoin
    base, url, _coerce_result = _coerce_args(base, url)
                                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/parse.py", line 121, in _coerce_args
    raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments

This error occurs when the uploaded file is larger than the max_size of SpooledTemporaryFile. (The default of max_size is 1 MB in FastAPI.)

If it is correct to use Graph().parse(source=file.file), we need another issue...