ericchiang / pup

Parsing HTML at the command line
MIT License
8.05k stars 251 forks source link

Cannot select two attributes with two "attr{}" calls on the same line #134

Open sebma opened 4 years ago

sebma commented 4 years ago

Hi, I'm using pup v0.4.0

I cannot select two different attributes using attr{} :

Selecting the title attribute of the link[type="application/x-wiki"] element :

$ curl -qs http://en.wikipedia.org/wiki/Robots_exclusion_standard | pup 'link[type="application/x-wiki"] attr{title}'
Edit this page

Selecting the rel attribute of the link[type="application/x-wiki"] element :

$ curl -qs http://en.wikipedia.org/wiki/Robots_exclusion_standard | pup 'link[type="application/x-wiki"] attr{rel}'
alternate

Selecting both attributes of the link[type="application/x-wiki"] element :

$ curl -qs http://en.wikipedia.org/wiki/Robots_exclusion_standard | pup 'link[type="application/x-wiki"] attr{title},link[type="application/x-wiki"] attr{rel}'
alternate

As you can see, pup only selected the last one.

Can you please have a look ?

Lewiscowles1986 commented 3 years ago

Is this a bug or a feature?

Where was it documented this ever worked this way?

sebma commented 3 years ago

@Lewiscowles1986 My title was a little confusing so I changed it to :

Cannot select two attributes with two "attr{}" calls on the same line

Lewiscowles1986 commented 3 years ago

Thanks. I think it helps more both for you and the author.

From looking at the code this looks like it would take an overhaul of how they do things.

For now I think chaining subprocess calls you may have more luck. Do the network call once and store the result of curl (or any call) in a something you can control, like a shell var or file. You can also filter that down so you are not eating the whole elephant for each call to pup. ```bash DATA=$(curl -qs http://en.wikipedia.org/wiki/Robots_exclusion_standard) WIKILINKS=$(echo $DATA | pup 'link[type="application/x-wiki"]') echo $WIKILINKS | pup 'link[type="application/x-wiki"] attr{title}' > titles.txt echo $WIKILINKS | pup 'link[type="application/x-wiki"] attr{rel}' > rels.txt # some command to take matching line numbers and match them up paste -d" " titles.txt rels.txt ``` You could of course pipe to pup to pre-filter and prevent duplicate processing, or submit a patch to pup to allow it to do something akin to the above, where it detected the number of args and processes the 1st, followed by the two follow-up selectors. The wikipedia link kept giving me EOF, so I don't know if they are under a lot of load right now, but I did a similar thing with GitHub ```bash #!/bin/bash DATA=$(curl -qs https://github.com/ericchiang/pup/issues/134) #echo $DATA AUTHOR_LINK_INFO=$(echo $DATA | pup 'h3.timeline-comment-header-text a.author.link-grey-dark') echo $AUTHOR_LINK_INFO | pup 'a attr{href}' > author-links.txt echo $AUTHOR_LINK_INFO | pup 'a text{}' > author-names.txt DATETIMEINFO=$(echo $DATA | pup 'a > [datetime]') echo $DATETIMEINFO | pup '[class] attr{datetime}' > when-raw.txt echo $DATETIMEINFO | pup '[class] text{}' > when-display.txt # some command to take matching line numbers and match them up paste -d" " author-names.txt author-links.txt when-raw.txt when-display.txt ``` Tested on Git bash for windows.
rjp commented 2 years ago

Looks like you can fudge around this with the JSON output these days?

curl -qsL http://en.wikipedia.org/wiki/Robots_exclusion_standard | pup 'link[type="application/x-wiki"] json{}' | jq -r '.[]|[.title, .rel] | @tsv'
Edit this page  alternate