EdJoPaTo / website-stalker

Track changes on websites via git
GNU Lesser General Public License v2.1
54 stars 6 forks source link

Sorting/Ordering is changed, but no change in content #189

Closed Hans-Maulwurf closed 3 months ago

Hans-Maulwurf commented 11 months ago

Hey,

first I'm happy that I found your tool, it's very very useful.

I have one special case for a change-tracking. I want to keep track of changes on the site https://antcheck.info/species/Colobopsis_leonardi There are cards with prices. The cards seem to be generated, so the order of the cards changes with every call. Of course website-stalker does recognize this as a change. Is there a way to re-order/sort the elements so there are only changes detected, when there are really changes? at the moment I use this config

  - url: https://antcheck.info/species/Colobopsis_leonardi
    editors:
      - css_select: .card-body
      - css_remove: img
      - regex_replace:
          pattern: "Last updated: \\d+ (hour|minute)(s)? ago"
          replace: ""
      - html_prettify
EdJoPaTo commented 11 months ago

I thought about something like this but didn't continued on that thought as I had no important use case.

An idea I had was something like this:

editors:
  - css_sort: .card-body
  # or
  - css_select:
      selector: .card-body
      sort: true
  # or
  - css_select:
      selector: .card-body
      sort: .badge-pill

The two second ones would be a bit more complicated to implement but seem more natural to use.

Basically this would need a selector to get what should be sorted to get a list of item. Then this list can be sorted by outerHTML or a selector.

This would also allow to add something like unique to filter out duplicated items:

editors:
  - css_select:
      selector: .card-body
      unique: true

I haven't thought much about how to implement it. Basically it's just an idea how it could be used afterwards. Any thoughts on this?

Hans-Maulwurf commented 11 months ago

Well something like that would help i guess ;)

But on the other hand, I found that this site has a separate API https://antcheck.info/api/docs

The response is json and so I use json_prettify. But in this there are values of "id" that change from time to time, so there is again a change detected, when there is no real change in content ;) if there would be a configuration like json_remove (like the css_remove) then I could use the API.

btw is there a way to get more details or an error? I often get ERROR ... expected value at line 1 column 1

but I cant see what went wrong. maybe because of rate limiting or something? the seconds between calls to the same URL (with different parameters) are not configurable, are they?

EdJoPaTo commented 11 months ago

Currently the duration between calls to the same domain are 5 seconds which is not configurable currently. Not sure if it would be interesting to configure that?

When there is an HTTP error it would already fail with that. So the request seems to respond with a successful but empty response which can not be parsed as JSON? There currently is no more information than the error printout. I am not sure what could be done better there. Maybe print what editor is failing there?

json_select / json_remove is the goal of #77 but was never really needed so I never approached it yet. Looks like you are the first one that is interested in it.

Hans-Maulwurf commented 11 months ago

Well I think the "best" improvement would be the thing with sorting/ordering html elements. If this would work, I wouldnt need the debug option or the json-remove. It would have to be designed that way that elements could be sorted but the sorting-agrument is inside this element in some sub-element.

Sorting/reordering with regex seems to be not really possible.

Hans-Maulwurf commented 11 months ago

@EdJoPaTo do you have an idea if you are able to develop this feature in the near future?

EdJoPaTo commented 11 months ago

I would like this feature myself so it’s definitely on my todo list. Not sure when exactly I have time for it but I would like to say sooner than later. Thank you for reminding me to increase its priority for me 😇

EdJoPaTo commented 11 months ago

btw is there a way to get more details or an error? I often get ERROR ... expected value at line 1 column 1

I improved the error message on editor errors (see cea6b7375dc9791f811b0fdfa512656a95f658cd). It now looks like this:

ERROR: https://edjopato.de/post/ in editor[4] json_prettify: expected value at line 1 column 1

The sorting is its own part which I haven't approached yet.

EdJoPaTo commented 11 months ago

Current working state is something like this:

- css_sort:
    selector: article
    sort_by: # here you can use every editor again which is applied to every selected html element
      - css_select: a
      - html_sanitize

In my testing case I found out I need the sanitize because of irregular links where the attributes are different on the links. But this was mainly because I was able to add debug prints into website-stalker while trying to understand what is happening there. This is not possible for users of website-stalker. Not sure how to deal with something like that in a useful way.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

EdJoPaTo commented 9 months ago

I am still not happy with the current approach as it’s hard to understand what’s going on…

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.