TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

output new datacheck entries #155

Open yakra opened 5 years ago

yakra commented 5 years ago

in https://github.com/TravelMapping/DataProcessing/issues/57#issuecomment-449699335, @michihdeu wrote:

btw: Would it be possible to output in Bash all new data errors caused by my latest changes?

Yes.

We could compare the new datacheck.log to the previous one but it's just my file. The old version is from my last run and all other new data errors caused by anyone else would be output too. But if there would be a way to get the last version from the "official" datacheck.log.....

fpcull can be used for this purpose* right now. It wouldn't output the new data to Bash though -- just write it to a specified file. The official datacheck.log is on noreaster at /home/www/tm/logs/datacheck.log .

While adding functionality to siteupdate.py would be easy, I'm a bit iffy on the "right" way to do it. The official datacheck.log is at the path noted above, not at logfilepath or anything specified via the siteupdate.py commandline. Hard-coding a path into siteupdate.py that may or may not exist on a given system seems Bad Form -- and I don't want to add a new argument just for this.

Perhaps the simplest thing to do is to execute fpcull from...


*Not the specific purpose it was designed for, but it'll work. It can be used for any case where you want to remove lines of text contained in one file from another file.

michihdeu commented 5 years ago

Thanks. I think I got it. Comparing /home/<user>/DataProcessing/siteupdate/python-teresco/logs/datacheck.log to /home/www/tm/logs/datacheck.log. The first directory is relative to datacheck.sh and the 2nd one would need an absolute hard-coded path. I don't see any issue as long as you catch an error when /home/www/tm/logs/datacheck.log would not exist. Or add an option, e.g. -c (compare) to datacheck.sh so that the comparsion is not done by default, e.g. if one sets up his own server. I never enter the three lines to Bash but only copy them. It would be no additional step for me. Nothing I could forget.

yakra commented 5 years ago

Why did I go straight to fpcull when a simple DIFF will do? :P

yakra commented 5 years ago

If you're in /home/michih/DataProcessing/siteupdate/python-teresco when you execute datacheck.sh, you can do diff /home/www/tm/logs/datacheck.log logs/datacheck.log to output the diff to Bash. If you want to save the diff as a file, diff /home/www/tm/logs/datacheck.log logs/datacheck.log | tee newdatacheckentries.diff or somesuch.

michihdeu commented 5 years ago

The first doesn't work: No such file or directory

[michih@noreaster ~]$ diff /home/www/tm/logs/datacheck.log logs/datacheck.log
diff: logs/datacheck.log: No such file or directory

The second outputs strange stuff but not what I need - the content of the file - and Permission denied to "my own" file:


[michih@noreaster ~]$ diff /home/www/tm/logs/datacheck.log | /home/michih/DataProcessing/siteupdate/python-teresco/logs/datacheck.log
-bash: /home/michih/DataProcessing/siteupdate/python-teresco/logs/datacheck.log: Permission denied
usage: diff [-aBbdilpTtw] [-c | -e | -f | -n | -q | -u] [--ignore-case]
            [--no-ignore-case] [--normal] [--strip-trailing-cr] [--tabsize]
            [-I pattern] [-L label] file1 file2
       diff [-aBbdilpTtw] [-I pattern] [-L label] [--ignore-case]
            [--no-ignore-case] [--normal] [--strip-trailing-cr] [--tabsize]
            -C number file1 file2
       diff [-aBbdiltw] [-I pattern] [--ignore-case] [--no-ignore-case]
            [--normal] [--strip-trailing-cr] [--tabsize] -D string file1 file2
       diff [-aBbdilpTtw] [-I pattern] [-L label] [--ignore-case]
            [--no-ignore-case] [--normal] [--tabsize] [--strip-trailing-cr]
            -U number file1 file2
       diff [-aBbdilNPprsTtw] [-c | -e | -f | -n | -q | -u] [--ignore-case]
            [--no-ignore-case] [--normal] [--tabsize] [-I pattern] [-L label]
            [-S name] [-X file] [-x pattern] dir1 dir2

I've deleted the first "ab" lines of datacheck.log to have a diff.

yakra commented 5 years ago

The first doesn't work: No such file or directory

My bad, my assumption above was incorrect; I see that you're in [michih@noreaster ~], and not /home/michih/DataProcessing/siteupdate/python-teresco.

The second outputs strange stuff but not what I need - the content of the file - and Permission denied to "my own" file:

[michih@noreaster ~]$ diff /home/www/tm/logs/datacheck.log | /home/michih/DataProcessing/siteupdate/python-teresco/logs/datacheck.log The problem here is the | character -- you're in effect trying to take a diff of a single file, and pipe the output of diff (which is itself receiving invalid input) to /home/michih/DataProcessing/siteupdate/python-teresco/logs/datacheck.log, which is not an executable.


I guess using absolute hard-coded paths would be more foolproof: diff /home/www/tm/logs/datacheck.log /home/michih/DataProcessing/siteupdate/python-teresco/logs/datacheck.log or to save the diff as a file, diff /home/www/tm/logs/datacheck.log /home/michih/DataProcessing/siteupdate/python-teresco/logs/datacheck.log | tee newdatacheckentries.diff etc.

michihdeu commented 5 years ago

It works 😃

[michih@noreaster ~]$ diff /home/www/tm/logs/datacheck.log /home/michih/DataProcessing/siteupdate/python-teresco/logs/datacheck.log
1c1
< Log file created at: 2018-12-27 08:41:07.235800
---
> Log file created at: 2018-12-28 06:55:14.700131
5,9d4
< ab.ab501;AB41;AB/SK;;VISIBLE_DISTANCE;11.61
< ab.ab501;RR51;AB41;;VISIBLE_DISTANCE;15.09
< ab.ab509;AB511;AB3;;VISIBLE_DISTANCE;28.57
< ab.ab511;RR253;AB509;;VISIBLE_DISTANCE;15.54
< ab.ab579;AB40;RR64A;;VISIBLE_DISTANCE;11.94
618,621d612
< deunw.l182wes;PetHenStr;K31;L184;SHARP_ANGLE;151.71
< deunw.l364;L47_S;L225_E;L225_W;SHARP_ANGLE;146.42
< deunw.l531sie;L531;;;LABEL_SELFREF;
< deunw.l703;L703;;;LABEL_SELFREF;

I think I'll always copy these lines to Bash after every data check now. It would be great to get it automatically after the "Data check successful" output one day.

Is there a chance to add a "NEW" or "DEL" for each line to indicate whether the line is new or deleted (and I should remove it from FPs)

yakra commented 5 years ago

It would be great to get it automatically after the "Data check successful" output one day.

This would be easy to add to datacheck.sh, etc. Ping @jteresco ?

Is there a chance to add a "NEW" or "DEL" for each line to indicate whether the line is new or deleted

(and I should remove it from FPs)

FPs are excluded from datacheck.log, and thus won't appear in the diff.

michihdeu commented 5 years ago

">"/"<": Thanks, I had only ">" but I see it now 👍 New unmatched FPs diff would be great too.

jteresco commented 5 years ago

No objection to automatic diffs, but we'll just have to make sure files exist before trying a diff on them so the script won't crash, say, on a first run in a given account.

yakra commented 5 years ago

My personal perspective: I'm not sure how much I'd use this, and am content to run diff manually when I want that info. I don't really use datacheck.sh; I just run siteupdate.py itself with whatever arguments I want/need at the time. I'll leave this to those who'd find it more useful, and/or know more about shell scripts than I do.

yakra commented 4 years ago

@michihdeu write:

It would be great to get it automatically after the "Data check successful" output one day.

Easy!

@jteresco wrote:

No objection to automatic diffs, but we'll just have to make sure files exist before trying a diff on them so the script won't crash, say, on a first run in a given account.

Having datacheck.sh diff $logdir/datacheck.log should not be a problem, as it will have just been created by the script. For those running datacheck.sh on noreaster, /home/www/tm/logs/datacheck.log will work fine. But for those running on another machine or home system, not so much.

A more foolproof solution, that can be added to the end of datacheck.sh:

echo -e "\nNew datacheck entries:"
diff <(curl -s http://travelmapping.net/logs/datacheck.log) $logdir/datacheck.log | grep '^>' | sed 's~^> ~~'

Note though, that this only lists the newly added entries.

@michihdeu write:

Is there a chance to add a "NEW" or "DEL" for each line to indicate whether the line is new or deleted (and I should remove it from FPs)

< deunw.l182wes;PetHenStr;K31;L184;SHARP_ANGLE;151.71 < deunw.l364;L47_S;L225_E;L225_W;SHARP_ANGLE;146.42 < deunw.l531sie;L531;;;LABEL_SELFREF; < deunw.l703;L703;;;LABEL_SELFREF;

I see the utility in this, too. List the deleted entries, to make sure that what's supposed to be deleted is deleted, and if it's ready to be removed from FPs.

IMO the most readable way to organize the old & new datachecks is all together, rather than in a line-by-line diff...

Removed datacheck entries:
deunw.l182wes;PetHenStr;K31;L184;SHARP_ANGLE;151.71
deunw.l364;L47_S;L225_E;L225_W;SHARP_ANGLE;146.42
deunw.l531sie;L531;;;LABEL_SELFREF;

New datacheck entries:
ab.ab501;AB41;AB/SK;;VISIBLE_DISTANCE;11.61
ab.ab501;RR51;AB41;;VISIBLE_DISTANCE;15.09
ab.ab509;AB511;AB3;;VISIBLE_DISTANCE;28.57

(If we don't want to bother with VISIBLE_DISTANCE, we could even add an option to filter them out, by tacking a | grep -v VISIBLE_DISTANCE onto the end of the diff command...)

We could either curl the canonical datacheck.log from the web twice, or wget it once & save to a temporary file, whatever's clever.

@michihdeu write:

New unmatched FPs diff would be great too.

This can also be done similarly.


A caveat:

This all assumes that your individual branch of the repo that gets pulled down at the start of the process https://github.com/TravelMapping/DataProcessing/blob/6e98c786abf729bf195473d9a9a4d0171f2e8c82/siteupdate/python-teresco/datacheck.sh#L30 is up to date with the newest changes in TravelMapping:master.

If not?

Not much to say about that, other than... Best practice if you wanna run siteupdate.sh is to make sure that your personal repo is up-to-date with the latest changes from TravelMapping:master.

yakra commented 4 years ago

^ Any thoughts on including

echo -e "\nNew datacheck entries:"
diff <(curl -s http://travelmapping.net/logs/datacheck.log) $logdir/datacheck.log | grep '^>' | sed 's~^> ~~'

at the end of datacheck.sh?

michihdeu commented 4 years ago

go for it!

yakra commented 4 years ago

Problem is though, if your repo isn't up-to-date with changes from TravelMapping:master, you'll see other users' datacheck errors that have been fixed in TravelMapping:master

Ping @jteresco, thoughts?

michihdeu commented 4 years ago

Mine is always up-to-date with master because it needs to be!

I didn't update my user repository from master yesterday. I only committed my update user list file. On syncing, I got an error (debug with commands...). After updating, it worked. I have this issue quite often when I don't update although there is no reason at all - I only update michih.list which is never updated by anyone else on master.

If I had to delete my user repository fork, set it up again.... no big deal, just doing. If I had a similar issue with hwy data and I had to delete my user repository fork.... Not worth to risk!

yakra commented 4 years ago

@michihdeu, I'm having trouble following your post...

I didn't update my user repository from master yesterday. I only committed my update user list file. On syncing, I got an error (debug with commands...). After updating, it worked. I have this issue quite often when I don't update although there is no reason at all - I only update michih.list which is never updated by anyone else on master.

I'm unsure what was happening here (Did you have trouble with https://github.com/TravelMapping/UserData/commit/b68a6a297d4e321741a0cadd5b38b4487e923576, https://github.com/TravelMapping/UserData/commit/2c2595534acfb00adfc941865c1da2bffef802a7 or https://github.com/TravelMapping/UserData/commit/2bf50afffafd81832f28990cc36488b0a22408f3?), but some months back I was experiencing errors about merge conflicts for unchanged files. Upgrading my git version cured this; maybe upgrading git could help you too?

There's always potential for merge conflicts with HighwayData updates -- either when merging master into our branches, or merging our branches into master. (At least in the latter case, those who are less comfortable with git can leave the conflict resolution to Jim or one of us or whoever.)

What I personally do to avoid the risk of having to delete and create my fork of the repo is, I keep my yakra:master branch clean, and only use it for merging in the latest changes from TravelMapping:master. I do my work in other branches, and make my pull requests from those. This way, if another branch gets FUBAR, I'll still have yakra:master to fall back on.

Most of this requires at least some level of comfort with git somewhere along the line...


Unless, Here's an idea... This could help make the results more foolproof & noise-free for those (like me, actually) who don't sync their repos to master all that often. Look for an optional myregions.cfg or tmregions.cfg or whatever we call it.

michihdeu commented 4 years ago

I'm unsure what was happening here

Thanks for asking, I finally got it while anwering 😄

It was this pull request. I use Github Desktop on my PC and this was pressing the Update from TravelMapping/master button. I don't know what's happening backend when pressing the button.

I've updated my user list file five days later(!) - assuming about 5 updates to master meanwhile - and committed. Then, I've pressed the Sync button which updates my fork. I don't know if anything else happens backend.

Nevertheless, I got a sync error.

I had to press the Update from TravelMapping/master button again and the sync worked afterwards. I could run datacheck;

cd ~/DataProcessing/siteupdate/python-teresco
git pull
sh datacheck.sh

to make sure that my user list file update was correct (no error in michih.log).

This procedure - for user repo - did sometimes work but I usually press the Update from TravelMapping/master button to avoid any trouble like the one described here.


What happened?

I've executed datacheck in-between. For this pull request to highway data, this and this. And the git pull command updates my fork(!!). That mean, my Sync would undo the changes to other user list files merged into master meanwhile.

Thanks for asking 👍 😄 😄

It's all off topic though 😆

michihdeu commented 4 years ago

Nope. git pull does not update User nor Hwy data of my noreaster folder from Travelmapping/master 😞

yakra commented 4 years ago

Nope. git pull does not update User nor Hwy data of my noreaster folder from Travelmapping/master 😞

I've not yet read your post above that one, but... I nosed around your home directory on noreaster. It looks like your repos there are both clones of michihdeu:master rather than TravelMapping:master:

[yakra@noreaster /home/michih/UserData]$ git branch * master [yakra@noreaster /home/michih/UserData]$ cat .git/refs/heads/master ac421bcc2472c05308926e248297f37de4a6b5c6

That's the most recent commit in michihdeu:master right now. It was merged into TravelMapping:master in a more recent commit.

[yakra@noreaster /home/michih/HighwayData]$ git branch * master [yakra@noreaster /home/michih/HighwayData]$ cat .git/refs/heads/master de7f837beb532be3c00bdc26a9bc5730618e6f08

That's the most recent commit in michihdeu:master right now, which is ahead of TravelMapping:master.

You would have to merge tm:master into your own fork, then push that commit back to GitHub. However that process normally works with GitHub Desktop. (Sync button?)

michihdeu commented 4 years ago

Thanks. I don't get why my master is one commit ahead. I merged in three commits today the very same way. I don't think that @jteresco must merge anything additionally?

yakra commented 4 years ago

https://github.com/michihdeu/HighwayData/ says " This branch is 2 commits ahead of TravelMapping:master", but I think that's wrong: https://github.com/michihdeu/HighwayData/commits only shows one commit, Merge remote-tracking branch 'refs/remotes/TravelMapping/master'

So your branch is ahead by one commit, one merge commit -- but all the highway data the repo contains, the WPTs & everything else, are identical. This merge commit is not in TravelMapping:master yet, but will be when your next pull request is merged.

michihdeu commented 4 years ago

The one commit contains the data from panda's pull request I merged into travelmapping:master and then sync my fork..... It's not important to understand it. Don't waste your time on it....

yakra commented 4 years ago

^^^^^^ Weird. I still don't get it. I almost thought for a second it had something to do with there being a more recent commit on you branch than other commits on tm:master -- like you said,

I've updated my user list file five days later(!) - assuming about 5 updates to master meanwhile

...But this shouldn't matter, if you're not changing anyone else's files and nobody else is changing your files. Git will (should?) be able to happily merge together all changes are long as there are no edits to the same line of the same file. In fact, when pushing new edits to tm:master, your fork will always be ahead: If you merge tm:master into your branch first, there will be a merge commit, like described above.

Nevertheless, I got a sync error.

Remember anything about what the error message said?

What happened?

I've executed datacheck in-between. For this pull request to highway data, this and this. And the git pull command updates my fork(!!).

git pull updates your fork/clone on noreaster. It pulls down the latest changes that are on GitHub. It won't change any of the data on GitHub itself.

That mean, my Sync would undo the changes to other user list files merged into master meanwhile.

Not quite, it's just that their changes are not in your repo yet -- they simply haven't been made in your fork. The changes will show up in your fork once you merge tm:master in.

So I truly don't understand what was happening to you. Not knowing anything about the error message, I'll just chalk it up to bugs in git reporting merge conflicts when there were in fact none; I've seen this behavior before.

yakra commented 4 years ago

It's not important to understand it. Don't waste your time on it....

😆

michihdeu commented 4 years ago

Remember anything about what the error message said?

nope. I was totally pissed off because I thought that I can risk to commit and sync without update from travelmapping.master first. The problem is, that the button is disabled. I need to wait about 5 minutes after starting the Git application. Then, it has recognized that there is newer data and I can manually update. It just takes a few seconds. The same story with each repository. And only the repo is checked which is selected. That means, I usually need to wait 5 minutes twice....

I still use an old version of the desktop application because I failed using the "new" one when I tried it first. 2.5 years ago.....

yakra commented 4 years ago

nope. I was totally pissed off because I thought that I can risk to commit and sync without update from travelmapping.master first.

Generally, you should be able to, especially with UserData where different people should not be editing the same files. Again... weird.

The problem is, that the button is disabled. I need to wait about 5 minutes after starting the Git application. Then, it has recognized that there is newer data and I can manually update. It just takes a few seconds. The same story with each repository. And only the repo is checked which is selected. That means, I usually need to wait 5 minutes twice....

Yuck. That sounds like a pain.

I still use an old version of the desktop application because I failed using the "new" one when I tried it first. 2.5 years ago.....

I don't have any personal experience with GitHub desktop. I've been using GitKraken since March; before that, the GItHub web interface.


Anyway...

Adding diff <(curl -s http://travelmapping.net/logs/datacheck.log) $logdir/datacheck.log | grep '^>' | sed 's~^> ~~' to datacheck, only to require a myregions.cfg to make it work properly, may be more trouble than it's worth.

Running the command manually is easier than it looks. If every time I log into noreaster, I type only

cd TravelMapping/DataProcessing/siteupdate/python-teresco/
./datacheck.sh
diff <(curl -s http://travelmapping.net/logs/datacheck.log) logs/datacheck.log | grep '^>' | sed 's~^> ~~'
exit

...I can cycle back these 4 previous commands by pressing the up arrow 4 times after logging in. I won't have to retype it out, or paste in the command from this thread. Does that work well enough for your purposes?

michihdeu commented 4 years ago

I've been using GitKraken since March; before that, the GItHub web interface.

Is it easy to use? Self-explaining? Or a freek tool? 😉

Does that work well enough for your purposes?

The status quo is not perfect but... ok

yakra commented 4 years ago

Is it easy to use? Self-explaining? Or a freek tool? 😉

Easy to use; it has a pretty self-explanatory GUI. Some stuff is not supported though; for example I have to use the commandline if I want to move or rename a file. :( If you're already comfortable with GitHub Desktop, it's probably not worth the bother of getting familiar with a new program.

The status quo is not perfect but... ok

only to require a myregions.cfg to make it work properly, may be more trouble than it's worth.

I meant to expand on this more, but forgot. :( I think it might be near the edge of people's abilities, and add another layer of confusion for most contributors. First, there's how to create or get the file onto noreaster in the first place. Then, editing it if picking up regions from, or turning regions over to, another contributor. What are we gonna have people do, use emacs? One wrong keystroke and you're toast!

yakra commented 1 year ago

Rethinking this. This idea of looking for new datacheck errors is probably the wrong approach. Better to focus on filtering the data relevant to each contributor's regions, using a MyRegions file as mentioned upthread.

For contributors like @michihdeu & me who always quickly fix our errors or mark them FP, any errors visible will be new ones. :)

https://forum.travelmapping.net/index.php?topic=4553.msg29376#msg29376

  1. Make dashboard.sh a canonical part of DataProcessing, in a location TBD.
  2. Add a command line switch to specify a log directory.
  3. If no regions are specified on the command line and no MyRegions file is found, terminate after giving instructions on how to create one.
  4. Add MyRegions to .gitignore, or else just instruct the script to always look for it in $HOME or something.
  5. Add a prompt at the end of datacheck.sh: Do you want to run the dashboard script? (Pressing Q will quit the text viewer.) Y/N:
  6. Pipe the output to less.
michihdeu commented 1 year ago

For contributors like @michihdeu & me who always quickly fix our errors or mark them FP, any errors visible will be new ones. :)

True.