dathere / datapusher-plus

A standalone web service that pushes data into the CKAN Datastore fast & reliably. It pushes real good!
GNU Affero General Public License v3.0
27 stars 21 forks source link

URL parameters can break temp files #77

Closed GordianDziwis closed 1 year ago

GordianDziwis commented 1 year ago

Describe the bug

2023-04-13 15:51:05,819 INFO Fetching from: https://statistik.leipzig.de/opendata/api/values?kategorie_nr=5&rubrik_nr=2&periode=y&format=csv...
[pid: 6|app: 0|req: 2/2] 127.0.0.1 () {32 vars in 501 bytes} [Thu Apr 13 15:51:05 2023] GET /job/54ebecf7-2a52-4347-a241-e6f651b7f9e0 => generated 930 bytes in 2 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 1)
2023-04-13 15:51:06,124 ERROR Job "push_to_datastore (trigger: date[2023-04-13 15:51:05 UTC], next run at: 2023-04-13 15:51:05 UTC)" raised an exception
Traceback (most recent call last):
  File "/usr/lib/datapusher-plus/lib/python3.10/site-packages/apscheduler/executors/base.py", line 125, in run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/usr/lib/datapusher-plus/src/datapusher-plus/datapusher/jobs.py", line 471, in push_to_datastore
    tmp = tempfile.NamedTemporaryFile(suffix="." + resource.get("format").lower())
  File "/usr/lib/python3.10/tempfile.py", line 698, in NamedTemporaryFile
    file = _io.open(dir, mode, buffering=buffering,
  File "/usr/lib/python3.10/tempfile.py", line 695, in opener
    fd, name = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/lib/python3.10/tempfile.py", line 395, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmppdo2j9wn.text/csv'

To Reproduce Steps to reproduce the behavior:

  1. Create a resource with this link: https://statistik.leipzig.de/opendata/api/values?kategorie_nr=5&rubrik_nr=2&periode=y&format=csv
  2. Upload resource with datapusher-plus
jqnatividad commented 1 year ago

Thanks for the report @BonaBeavis .

Can confirm this is a bug.

GordianDziwis commented 1 year ago

Where are the path for creating and reading the temp files created? I can look into this.

jqnatividad commented 1 year ago

Looking at the error, it seems that we need to add some logic to sanitize the file extension.

The resource format attribute can sometimes be the full MIME type which contains a slash, and not just the extension.

@BonaBeavis , can you confirm and fix this?

GordianDziwis commented 1 year ago

Sure can you point me to the line, where the path is built?

jqnatividad commented 1 year ago

Great!

The whole download section is here - check line 471 where we just grab resource.format.

Perhaps we should check that the resulting temp file is valid, and handle the case when the full MIME type is stored instead of just the file extension (e.g. application/csv and not just csv)

https://github.com/dathere/datapusher-plus/blob/3a3c69687f16973cbdfa9f52a4d6db11ac48f908/datapusher/jobs.py#L429-L555

GordianDziwis commented 1 year ago

Thanks!!