Closed jeremyrsellars closed 10 years ago
Yeah, I never quite got that one working (hence the disclaimer)
The problem I had with just adding a +
was this:
=> (clojure.core/re-seq #"[^\t]+" "abc\t123")
("abc" "123")
=> (clojure.core/re-seq #"[^\t]+" "\tabc\t\t123\t")
("abc" "123")
It drops empty columns, which isn't good if your data is structured. I played around with it a bit this morning & came up with this regex:
=> (map second (clojure.core/re-seq #"(?<=\t|^)([^\t]*)(\t|$)" "\tabc\t\t123\t"))
("" "abc" "" "123" "")
I'm hoping that the lookbehind doesn't cause perf or memory problems. Could you give it a try on some real data & see if you have any problems?
Thanks for the pull request!
FWIW, load-lazy
should only be used if you have really wide rows with thousands of columns. For anything smaller, load-tsv
should be sufficient.
I originally made this in an attempt to process really wide rows, but there's still a problem in that Pig can't incrementally emit values. This means that even if you follow load-lazy
with a mapcat
to pivot the wide row into smaller ones, all of the values for each row still have to fit into memory.
Said another way, Pig doesn't support returning a lazy seq. You have to materialize the whole list & return it at once. It's one of the disappointing limitations of Pig.
If you can filter the row, or reduce over it while you read it, then you shouldn't have this problem. Either way, I thought I would point out what I found. In the end I ended up using another machine with more memory to preprocess the wide rows & make them more hadoop friendly. The nice part was the only code change I had to make was to remove the pig/
namespace :)
@mbossenbroek, Thanks for the quick thinking on that regex. This regex #"\\G([^\t]*)(?:[\\t]|$)"
works just like yours, except:
\G
anchors to the start of string or the end of the previous match, so it doesn't need to look behind.(?:[\t]|$)
doesn't capture the delimiter, saving a bit of work and string creation. And if someone is trying to split on any of [,:;]
, then that behavior is honored.I updated the pull request so it works like this when the delimiter is \t
:
(map second (re-seq "\G([^\t]+)(?:[\t]|$)" s))
It seems like the best option is to implement a lazy version of clojure.string/split
, which would remove the potentially unexpected split-on-any-char behavior by using (.indexOf s delimiter)
.
FWIW, my particular use case for load-lazy
was parsing test data where all the records were on one line, so efficiency wasn't a concern.
The \G still seems to return an extra empty string on the end:
=> (map second (re-seq #"\G([^\t]*)(?:[\t]|$)" "\tabc\t\t123\tfoo"))
("" "abc" "" "123" "foo" "")
Did I key it in wrong? I've never used \G before (or lookbehind), but that's what I was trying to avoid.
I had planned to write something more efficient & lazy, but never got around to it. The use of clojure.string/split
used in load-tsv
could use some speedup too. It takes a regex, but delimiters rarely actually require a regex delimiter.
Oops. You are right. I don't see a way of using the \G
. It sounds like the best approach is the lazy split.
Another option that just occurred to me: load-string
. This didn't exist when I wrote load-lazy
, but that'll just give you the raw string for each line. You could then use any variety of splitting code in a subsequent pig/map
operator.
I'd still like to fix load-lazy
, but just another idea...
I wrote a lazy-split, but am having a hard time getting it integrated into the tests.
It splits based on a string, not a regex. Is it also necessary to support lazy splitting on a regex?
(defn lazy-split
"Returns a lazy sequence of the string split on the delimiter."
[s delimiter]
(let [^String s (str s)
delimiter (str (if (empty? delimiter) "\t" delimiter))
step (fn step [index]
(when (<= index (.length s))
(let [index-of-delimiter (.indexOf s delimiter index)
end (if (<= 0 index-of-delimiter) index-of-delimiter (.length s))
next-step (+ end (.length delimiter))]
(cons (subs s index end) (lazy-seq (step next-step))))))]
(step 0)))
I like it - nice job!
Could you could put it in pigpen.extensions.core? I think it fits best there.
Just a couple of small polish suggestions - I'd prefer to type hint/assert the parameters instead of coercing them to strings. As it is here, the user could pass in a regex delimiter and it would be treated as a string. I'd prefer to fail if they don't provide the right input. Then you also wouldn't need the [^String s (str s)]
bit.
What problems are you having with the tests?
If you get a chance, could you run this through Criterium? I'd like to see how it compares to the regex approach or clojure.string/split
. It might be worthwhile to use this in load-tsv
and get rid of load-lazy
altogether. We could potentially use this if the user passes a string delimiter & fall back to one of the others for a regex. The only pitfall I could see is that load-tsv
would now return lists instead of vectors - I'm wondering if anyone is depending on that distinction...
Whatever the result - I'm fine with patching load-lazy
to only support string delimiters. It doesn't work now & it has a disclaimer in the docstring that it might break :)
My sincere apologies for the delayed response! Github decided not to email me your last response to this thread. I was out on vacation this last month & wasn't checking the site directly.
I'll get this merged & released today. Thanks for your work on it. Again, sorry for the delay.
load-lazy returns all characters between delimiters, instead of one at a time. The existing implementation when loading
would return
[("a" "b" "c" "d" "1" "2" "3" "4")]
, but should return[("abcd" "1234")]
.