Support for line-by-line operations on large files

GoogleCodeExporter commented 8 years ago

I often need to make line-by-line modifications to large files (1,000,000 lines 
or more), but pyp is not currently suited to this as it reads in the complete 
input before producing output. Would it be possible to add a mode for pyp that 
produces line-by-line output without loading the entire file at once? Obviously 
this would prohibit some operations, such as those involving 'pp', but much 
useful functionality would remain.

Original issue reported on code.google.com by neatn...@gmail.com on 4 Mar 2012 at 2:42

GoogleCodeExporter commented 8 years ago

That's a good point. I looked into this when I started, but it had a number of 
drawbacks. I'll revisit when i get a chance...If i can find a straightforward 
way of doing this, maybe with a "--turbo" flag, I'll incorporate it.  I'm 
definitely open to input from the open source community regarding this as well.

Original comment by tobyro...@gmail.com on 12 Mar 2012 at 10:52

GoogleCodeExporter commented 8 years ago

Hi, please try this beta and let me know how it goes...use --quick_output for 
large files.

Original comment by tobyro...@gmail.com on 15 Mar 2012 at 11:21

Attachments:

pyp_beta_2.11.1

GoogleCodeExporter commented 8 years ago

I've tested it with commands like

cat big.file | pyp --quick_output "t[0]+'\t'+t[1]+'\t\t'+t[2]"

and it works as expected. Thanks!

Original comment by neatn...@gmail.com on 17 Mar 2012 at 3:56

GoogleCodeExporter commented 8 years ago

I found a couple of cases for which passing -q leads to some missing lines of 
output:

$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp -q "rel(r'^\d [23]')"
1 1

$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp "(int(w[1]) not in {2,3})"
1 1
4 4
5 5

$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp -q "(int(w[1]) not in {2,3})"
1 1

$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp "rel(r'^\d [23]')"
1 1
4 4
5 5

Original comment by neatn...@gmail.com on 3 Apr 2012 at 8:23

GoogleCodeExporter commented 8 years ago

ok, thanks for the update on the beta. I'll look into this.

Original comment by tobyro...@gmail.com on 3 Apr 2012 at 10:04

GoogleCodeExporter commented 8 years ago

I *think* the issue is that Pyp.n is not being incremented, so safe_eval() does 
nothing after the first line that evaluates to False. But I don't understand 
the code well enough to know how to fix it.

Original comment by neatn...@gmail.com on 4 Apr 2012 at 3:52

GoogleCodeExporter commented 8 years ago

Ok,please try this on. Curly brackets don't for me, but the command is 
essentially the same. Let me know if it works. This is the latest beta that 
should also deal with unintentional stripping.  It's still beta though, so 
please let me know if you see any weirdness.

for i in 1 2 3 4 5; do echo "$i $i"; done | pyp_beta_2.11.5.py -q "(int(w[1]) 
not in [2,3])" 
1 1
4 4
5 5

Original comment by tobyro...@gmail.com on 14 Apr 2012 at 8:19

Attachments:

pyp_beta_2.11.5.py

GoogleCodeExporter commented 8 years ago

new pyp_beta should fix this:
http://code.google.com/p/pyp/downloads/detail?name=pyp_beta&can=2&q=#makechanges

Original comment by tobyro...@gmail.com on 16 May 2012 at 9:37

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

It seems that the new pyp_beta(2.11.23) does not include this --quick_output 
feature, right?

Original comment by apte...@gmail.com on 20 May 2012 at 3:13

GoogleCodeExporter commented 8 years ago

Quick output mode is now on by default unless using one of the list 
operators(pp, spp, fpp), so we removed the flag. Cheers, t

Original comment by tobyro...@gmail.com on 20 May 2012 at 5:07

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

That's weird. You should see immediate output without the redirection. Is that 
your exact command? Make sure you are running pyp_beta.  Let me know if the 
older version with the flag runs faster. 
Thanks, 

T

Original comment by tobyro...@gmail.com on 20 May 2012 at 6:31

GoogleCodeExporter commented 8 years ago

Hi, are you still seeing these issues with pyp_beta?

Thanks, t

Original comment by tobyro...@gmail.com on 22 May 2012 at 1:19

GoogleCodeExporter commented 8 years ago

Sorry for late response. I tested pyp_beta_2.11.1 and 2.11.5 using the 
--quick_output option, and both worked for large file without first loading the 
file.

Original comment by apte...@gmail.com on 22 May 2012 at 2:51

GoogleCodeExporter commented 8 years ago

Ok, thanks for checking that out. We'll look into this...that's a key feature.

t

Original comment by tobyro...@gmail.com on 22 May 2012 at 3:45

GoogleCodeExporter commented 8 years ago

Hi, I think I've found the problem...could you try this version and let me know 
how it goes?  

Thanks again for your help, it's a good suggestion, and I think it feels more 
responsive when running simple commands.  It's got a fairly complex switching 
routine, so it's taking a while to iron out the bugs.

t

Original comment by tobyro...@gmail.com on 2 Jun 2012 at 4:47

Attachments:

pyp_beta

GoogleCodeExporter commented 8 years ago

With the new pyp_beta, I can get output without loading whole file into memory. 
However, it seems that pyp_beta is quite slow for large file processing.

I tested the performance for awk and pyp, using the following simple example. 
The file(article_categories_en.nt, around 2G) I use is downloaded from DBpedia, 
which contains about ten million lines.

 /usr/bin/time -o awk.time cat article_categories_en.nt | awk '{print $1,$3}' > test.awk

 /usr/bin/time -o pyp.time cat article_categories_en.nt | ./pyp_beta  'w[1],w[3]' > test.pyp

I am not sure if I do thing right(I am new both to pyp and awk). Using the 
above commands, awk takes around 13s to produce the output file, which is 
around 1.5G. For pyp_beta, half an hour passed but it is still running, and 
only produces about 30M output file.

Though this case is too easy to show the true power of pyp, it seems that the 
performance issue is really annoying.

Original comment by apte...@gmail.com on 2 Jun 2012 at 7:22

GoogleCodeExporter commented 8 years ago

I assume you see this level of performance with the earlier pyp betas as well.  
Unfortunately, I think we'll have to go to fully compiled code to get the level 
of performance you need.  Most users of pyp are working on much smaller data  
sets.  Thanks for your help testing....hopefully we will get this compiled at 
some point.

Original comment by tobyro...@gmail.com on 6 Jun 2012 at 6:05

erinxocon / pyp

Support for line-by-line operations on large files #5