Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

load-lazy returns one character at a time #41

Closed jeremyrsellars closed 10 years ago

jeremyrsellars commented 10 years ago

load-lazy returns all characters between delimiters, instead of one at a time. The existing implementation when loading

abcd\t1234

would return [("a" "b" "c" "d" "1" "2" "3" "4")], but should return [("abcd" "1234")].

mbossenbroek commented 10 years ago

Yeah, I never quite got that one working (hence the disclaimer)

The problem I had with just adding a + was this:

=> (clojure.core/re-seq #"[^\t]+" "abc\t123")
("abc" "123")
=> (clojure.core/re-seq #"[^\t]+" "\tabc\t\t123\t")
("abc" "123")

It drops empty columns, which isn't good if your data is structured. I played around with it a bit this morning & came up with this regex:

=> (map second (clojure.core/re-seq #"(?<=\t|^)([^\t]*)(\t|$)" "\tabc\t\t123\t"))
("" "abc" "" "123" "")

I'm hoping that the lookbehind doesn't cause perf or memory problems. Could you give it a try on some real data & see if you have any problems?

Thanks for the pull request!

mbossenbroek commented 10 years ago

FWIW, load-lazy should only be used if you have really wide rows with thousands of columns. For anything smaller, load-tsv should be sufficient.

I originally made this in an attempt to process really wide rows, but there's still a problem in that Pig can't incrementally emit values. This means that even if you follow load-lazy with a mapcat to pivot the wide row into smaller ones, all of the values for each row still have to fit into memory.

Said another way, Pig doesn't support returning a lazy seq. You have to materialize the whole list & return it at once. It's one of the disappointing limitations of Pig.

If you can filter the row, or reduce over it while you read it, then you shouldn't have this problem. Either way, I thought I would point out what I found. In the end I ended up using another machine with more memory to preprocess the wide rows & make them more hadoop friendly. The nice part was the only code change I had to make was to remove the pig/ namespace :)

jeremyrsellars commented 10 years ago

@mbossenbroek, Thanks for the quick thinking on that regex. This regex #"\\G([^\t]*)(?:[\\t]|$)" works just like yours, except:

  1. The \G anchors to the start of string or the end of the previous match, so it doesn't need to look behind.
  2. The (?:[\t]|$) doesn't capture the delimiter, saving a bit of work and string creation. And if someone is trying to split on any of [,:;], then that behavior is honored.

I updated the pull request so it works like this when the delimiter is \t: (map second (re-seq "\G([^\t]+)(?:[\t]|$)" s))

It seems like the best option is to implement a lazy version of clojure.string/split, which would remove the potentially unexpected split-on-any-char behavior by using (.indexOf s delimiter).

FWIW, my particular use case for load-lazy was parsing test data where all the records were on one line, so efficiency wasn't a concern.

mbossenbroek commented 10 years ago

The \G still seems to return an extra empty string on the end:

=> (map second (re-seq #"\G([^\t]*)(?:[\t]|$)" "\tabc\t\t123\tfoo"))
("" "abc" "" "123" "foo" "")

Did I key it in wrong? I've never used \G before (or lookbehind), but that's what I was trying to avoid.

I had planned to write something more efficient & lazy, but never got around to it. The use of clojure.string/split used in load-tsv could use some speedup too. It takes a regex, but delimiters rarely actually require a regex delimiter.

jeremyrsellars commented 10 years ago

Oops. You are right. I don't see a way of using the \G. It sounds like the best approach is the lazy split.

mbossenbroek commented 10 years ago

Another option that just occurred to me: load-string. This didn't exist when I wrote load-lazy, but that'll just give you the raw string for each line. You could then use any variety of splitting code in a subsequent pig/map operator.

I'd still like to fix load-lazy, but just another idea...

jeremyrsellars commented 10 years ago

I wrote a lazy-split, but am having a hard time getting it integrated into the tests.

It splits based on a string, not a regex. Is it also necessary to support lazy splitting on a regex?

(defn lazy-split
  "Returns a lazy sequence of the string split on the delimiter."
  [s delimiter]
  (let [^String s (str s)
        delimiter (str (if (empty? delimiter) "\t" delimiter))
        step (fn step [index]
               (when (<= index (.length s))
                 (let [index-of-delimiter (.indexOf s delimiter index)
                       end (if (<= 0 index-of-delimiter) index-of-delimiter (.length s))
                       next-step (+ end (.length delimiter))]
                   (cons (subs s index end) (lazy-seq (step next-step))))))]
    (step 0)))
mbossenbroek commented 10 years ago

I like it - nice job!

Could you could put it in pigpen.extensions.core? I think it fits best there.

Just a couple of small polish suggestions - I'd prefer to type hint/assert the parameters instead of coercing them to strings. As it is here, the user could pass in a regex delimiter and it would be treated as a string. I'd prefer to fail if they don't provide the right input. Then you also wouldn't need the [^String s (str s)] bit.

What problems are you having with the tests?

If you get a chance, could you run this through Criterium? I'd like to see how it compares to the regex approach or clojure.string/split. It might be worthwhile to use this in load-tsv and get rid of load-lazy altogether. We could potentially use this if the user passes a string delimiter & fall back to one of the others for a regex. The only pitfall I could see is that load-tsv would now return lists instead of vectors - I'm wondering if anyone is depending on that distinction...

Whatever the result - I'm fine with patching load-lazy to only support string delimiters. It doesn't work now & it has a disclaimer in the docstring that it might break :)

mbossenbroek commented 10 years ago

My sincere apologies for the delayed response! Github decided not to email me your last response to this thread. I was out on vacation this last month & wasn't checking the site directly.

I'll get this merged & released today. Thanks for your work on it. Again, sorry for the delay.