ericchiang / pup

Parsing HTML at the command line
MIT License
8.1k stars 257 forks source link

Can anyone help me with the following? #172

Open KaMyKaSii opened 2 years ago

KaMyKaSii commented 2 years ago

I'm no html expert, I just want to get a string from a site to use in a shell script. What command can I use on this page to get the string "2022-02-10 23:09:03"? Any help is appreciated. Thanks.

rjp commented 2 years ago

Looks like pup won't let you access the wire:initial-data attribute directly (which seems like a bug to me, will probably create an issue later) but you can work around that with the json{} output and jq (or other JSON processor, I guess?)

cat 45480728909.html | \
pup -p 'div[wire:initial-data] json{}' | \
jq -r '.[]|."wire:initial-data"|fromjson|.serverMemo.data.stream.stream_created_at|select(.)' | \
sort -u

Since 2022-02-10 23:09:03 is only mentioned in the wire:initial-data attribute of various divs, we match those, print them as JSON (using the -p flag to convert the entities), then use jq to do the heavy lifting of 1) getting that attribute, 2) converting it to a real object, 3) finding the stream_created_at key (which is the only one that matches the given date), 4) removing the nulls from the list, and then using sort -u to condense it to a unique list (which in this case is just the one date.)

(If you don't have sort, you can do the uniquification in jq: jq -r '[.[]|."wire:initial-data"|fromjson|.serverMemo.data.stream.stream_created_at|select(.)]|unique|.[]')

rjp commented 2 years ago

If PR https://github.com/ericchiang/pup/pull/175 gets pulled in, you can change the pup part to pup -p 'div[wire:initial-data] attr{wire:initial-data}' which will retrieve the data and simplifies the jq bit later.

cat 45480728909.html | \
pup -p 'div[wire:initial-data] attr{wire:initial-data}' | \
jq -sr '.[]|.serverMemo.data.stream.stream_created_at|select(.)' | \
sort -u