properly parse csv query results

callum-oakley commented 2 years ago

The problem described in #122 is not actually an issue with the string interpolation, it's an issue with parsing the results of a SPARQL query.

We ask for SPARQL query results in text/csv format, but parse them as simple line separated values. This works fine until one of the values is quoted (because it contains a comma, newline, etc). This PR pulls in clojure.data.csv to parse the response as a proper CSV.

closes #122 closes #121

RickMoynihan commented 2 years ago

Hi I just skimmed over this PR, and I don't really know the details of the code, but the first thing that stood out is that we don't appear to be closing the CSV file.

It looks like @callum-oakley has noticed similar things because he's added a comment around deleting the file (which I also agree is a potential problem).

More generally the issue is mixing side-effects with laziness, and that the lazy-seq is escaping.

I think fundamentally we should ask what is "pulling" the lazy sequence of csv rows, as that is where the context where the resource management needs to occur. I believe that is probably this function, which is executing an eager reduce over the file:

https://github.com/Swirrl/ook/blob/9acc7728318f446f556af43e9691545544eaa312/src/ook/etl.clj#L246-L260

I'd suggest considering putting a with-open in here, and doing the cleanup here too.

Also you may consider looking at using transducers, for the partitioning filtering/mapping ops and then switching the reduce to a transduce taking that as its xform; to do that you may want to use cgrands excellent xforms library to provide the missing transducer based partition function; though maybe you can get away with clojure.core/partition-all instead??

Anyway just some general advice, I appreciate that may not be core to this issue; so could be explored in another PR or moved into an issue to tackle another day.

callum-oakley commented 2 years ago

@RickMoynihan the reader over the CSV file is closed on line 108.

It's a little buried... so I'm not surprised you missed it! I think the behaviour here is correct, if a little confusing. We wait until the CSV has been read to the end before closing the reader, so it's only leaking the reader if we fail to read the CSV all the way for some reason.

That being said, I agree it would be nice to refactor this to keep everything eager and use with-open, but it's not just a case of pulling the with-open up to pipeline*, since pages can actually be drawing from many separate files (one per graph, the resulting lazy sequences are concatenated in select-paged). I think one approach that has legs would be to avoid lazy sequences entirely, and instead pass a page handling function down inside the read loop, and process each page eagerly. (or use transducers as you say)

The TODO I left about deleting the temporary file before we've finished reading it is a related but separate issue, and I think it genuinely could be a problem... but on the other hand seems to work fine at the moment 🤷 (I suppose we might already have already read the whole file in to memory at this point?)

RickMoynihan commented 2 years ago

Cool. I agree with all of that @callum-oakley; I hadn't read the code in enough depth to see the correct place to close the file.

Sticking a .close at the end of a seq works in the happy path; but as you say not in the case of an exception, either in the I/O or in the other processing in the lazy-seq. Your suggestions all seem very sensible.

🙇

Robsteranium commented 2 years ago

Thanks Callum. This is great and makes much more sense than my supposition!

In terms of Rick's comments...

We already spill to disk in a with with-open.

read-paged calls .close explicitly itself - IIRC I had to do this so we could read lines out lazily and retain control over closing the reader at the end.

We could wrap this in a try with another .close in a finally but I'm not sure it's necessary. Will not the the .delete which is in a finally automatically close the reader?

In terms of Callum's TODO comment...

Won't that .delete only get called once the lazy-seq in read-paged has been consumed in the pipeline's doseq?

Robsteranium commented 2 years ago

I think we can merge this. It might be nice to figure out first exactly what we want to do about the other stuff discussed here then file issues instead of the TODO comment.

RickMoynihan commented 2 years ago

Firstly yes I agree none of the above need be a blocker on merging the PR.

Feel free to make an issue to address the other issues...

To answer your points though:

read-paged calls .close explicitly itself - IIRC I had to do this so we could read lines out lazily and retain control over closing the reader at the end.

Yes callum drew my attention to this, and it means you "should" be ok in the happy path, except for @callum-oakley's concerns in the TODO comment; which is that the deletion could happen before the sequence has been consumed... i.e. before the csv has been fully read.

We could wrap this in a try with another .close in a finally but I'm not sure it's necessary. Will not the the .delete which is in a finally automatically close the reader?

In short, I don't think so... and if it were the case it's arguably an obscure enough detail to warrant explicit attention. My belief is that deleting the file is NOT really the same as cleaning up the resources associated with the file handle in the jvm/os.

Regardless be careful because we're asking/answering the wrong question, the TODO comment is asking the right one, and it is IMHO also unsafe.

i.e. the real problem is that the finally will I think execute before the lazy sequence is fully evaluated. Why you're not seeing an error in practice I don't know; but it's most likely because the sequence has been partially (or maybe even fully) evaluated before then. lazy sequences can be very subtle and I suspect it is only the subtleties of the functions you are applying here that is saving you from the error. For example there isn't really one "lazy sequence" implementation, there are more correctly sequence functions that return partially or unrealised sequences. e.g. map batches 32 at a time, but IIRC lazy-seq will do 1 at a time, and other idioms such as using apply may mean different quantities of realisation.

Robsteranium commented 2 years ago

Was that issue introduced by this PR (with the switch to map/csv/read-csv) or did it exist before (with .readLine)?

I'd be happy to raise an issue, but I'm not sure what it should say! Could you confirm/ correct the following interpretation please?

we don't need the with-open/ transducers in pipeline* (as originally suggested) because we've confirmed that the paged-reader is calling .close
the finally on L109 could nevertheless see the cache file deleted before the pipeline has finished processing the results of read-paged from L108
this could happen even without an exception during reading (we're positing the possibility of a race condition)
the reduce on L249 isn't eager enough to obviate the risk

Presumably if this happens we will see an exception being raised about the file being closed prematurely? If so I think it's enough to wait for this to happen before fiddling with it.

Callum has suggested we read eagerly. Perhaps I've misunderstood but I'm not sure this will work for very large graphs (we're talking GB of URIs).

You've both suggested transducers. I guess the xform would receive/ create something non-lazy but without doing some research I'm having trouble imagining exactly what that's supposed to look like and how it'll deal with not loading everything into memory at once.

I guess another solution might be to delete the file after closing the reader (maybe giving read-paged a callback). Ideally we'd also wrap this with a finally to ensure we don't leave (large) cache files lying around after exceptions but I think that just puts us back to square one!

callum-oakley commented 2 years ago

Was that issue introduced by this PR (with the switch to map/csv/read-csv) or did it exist before (with .readLine)?

Yes this existed before. The root of the problem is that read-paged (and by extension select-paged) return a lazy sequence that reads from the file, but the finally in select-paged will run before select-paged has returned, so in principal at least the file is deleted before it is read (in the reduce)! In practice this clearly isn't happening (perhaps the delete is queued and we manage to read the file to completion before it actually happens) but I wouldn't be confident that it will always work.

Your bullet points all look correct to me, apart from the last one. reduce is fully eager, but it doesn't matter how eager it is, since the finally in select-paged has already run before the reduce starts.

Callum has suggested we read eagerly. Perhaps I've misunderstood but I'm not sure this will work for very large graphs (we're talking GB of URIs).

Instead of reading results in to a lazy sequence and passing that sequence to a function which processes an element at a time, you can move that processing function inside the read loop. So you're reading eagerly, but processing each element as you read it, so it doesn't have to remain in memory. Something like:

(defn process-file [file system etc]
  (with-open [r (io/reader file)]
    (let [data (map first (csv/read-csv r)) ;; this is still lazy, but we read it to completion before returning
          var-name (first data)]
      (reduce
       (fn [counter uris]
         (log/info "Processing page starting with" index "subject" counter)
         (if uris
           (with-logged-retry (insert-values-clause construct-query var-name uris)
             (->> (extract system construct-query var-name uris)
                  (transform jsonld-frame)
                  (load-documents system index)))
           (log/warn (str "No compatible (" index ") subjects found!")))
         (+ counter (count uris)))
       0
       (partition-all page-size (rest data))))))

The function inside the reduce is the same, but this can be called for each file, rather than reading all the files in to lazy sequences "in advance" (but not really because laziness) and then processing them in a single big reduction.

Robsteranium commented 1 year ago

I've created #126 for moving processing with the read loop as Callum has suggested.

Swirrl / ook

properly parse csv query results #124