Open mih opened 5 years ago
https://github.com/con/communitator was my quest for the tool which would take care about such specs. Helpme from Vanessa was the only one suggested
Might be worth considering to also include (and allow to edit of course) all relevant config files. May be even some structural info on the dataset the command ran in (like super-/subdatasets).
This would be very doable with helpme (sorry didn't see this when it was posted). For GitHub, it would require a user token, which is a likely barrier in that a user wouldn't readily want to generate and provide one.
Even if we set up a post to discourse (askci or neurostars or similar) with tag / topic of "datalad" we would still require the user to accept (on first issue) the OAuth screen.
Or a third option - a custom datalad submitter that sends data somewhere else?
For GitHub, it would require a user token
oh... that reminded me about https://github.com/vsoch/helpme/issues/44 -- IMHO no token should be necessary, the full form submission I believe could be crafted via URL. Then any user could submit a PR
Even if we set up a post to discourse (askci or neurostars or similar) with tag / topic of "datalad" we would still require the user to accept (on first issue) the OAuth screen.
yes, neurostars would be the one, and I think it is ok for a user to accept OAuth screen if really needed (i.e. couldn't be done similarly via passing the whole body via url)
@yarikoptic that's definitely doable! The url would open and then (hopefully) the user would be logged in to submit.
Would it be okay with you to add a custom datalad helper for this purpose? The official github integration happens with the token (more progamatically) and I wouldn't want to fuss around with that.
Or are you more interested in submission to neurostars?
Going for a quick run, back in a bit!
Oh I just had a random cool idea (that will need a little more thinking through). Helpme is optimized so far for the user / support staff - the issue is posted to GitHub. But another alternative use case would be to build another layer that can process what helpme produces, and ultimately produce some kind of error metrics. For example, imagine if there was a repo "datalad-support" where the issues came in, but each time a new issue was posted we had a GitHub workflow to process the metadata and update some tiny (flat file based) database stored with the repository. Even if it's just a listing of the things listed above, it would be cool to see what kind of things we can learn from the data. If it works well or provides insights for datalad, I would bet other open source projects would be interested too! I could make a helpme client in other languages, if it were desired.
okay now run!
Back! The high level goal is to provide automated bug reports, but also to turn those bug reports into actionable data.
I think that the last idea is close to what I thought to "research" at some point -- an open alternative to https://sentry.io, the service to automatically report crashes etc to. Such service could inform us (even without "manual" bug reports) on what kind of problems users run into, and how common they are. Since I am afraid errors might be "too common" etc, I would not make it file a new issue for each occasion. There should be some "fingerprint" of a crash, and follow up on existing ones (providing more OS etc info) if fingerprint matches. So, might be quite tricky to implement (unless there is already some solution). But if it is github based ("datalad-support" repo you thought about) - it would not be available to all users, but only to those who have github token registered I guess. So it would be of a different use -- probably just to automagically upload gory details of an error to later link to in a "manual" bug report. But that again could be done as part of the initial idea, here -- and github would match by the title (at least) if another report like that was reported already, thus possibly helping to eliminate duplicates. Or have I misunderstood the idea?
You know, sentry.io has a free tier - I use it on a lot of projects and it works great. It does require, however, a token :)
You hit the right thread of what I'm getting at, and let me provide some more detail to help.
Does that make sense? It's a totally free / open source and hacky way of getting the most simplest of functionality of sentry.io (a record of the issue) based with GitHub.
@mih could you give me an example of how you'd retrieve the traceback, and output from this wtf tool?
Hey @vsoch !
@mih could you give me an example of how you'd retrieve the traceback, and output from this wtf tool?
Would you need to hook into another tool from your end? I though the easiest way (and TBH only way I can imagine right now), would be to hook into datalad's last layer of the command line interface, where all exceptions bubble up to (around here). Given the exceptions themselves, any recoverable traceback should be accessible from that point, and could be fed into a reporting helper. In addition, wtf()
utilized there to amend the report with system info.
Am I on the right path?
output from this wtf tool? you can find its code under datalad/plugin/wtf.py and tests under datalad/plugin/tests/test_plugins.py . Simplest invocation:
python -c 'from datalad.api import wtf; res=wtf(); print(">%s<" % wtf())'
would show that we return the structure, rendered version is printed to the screen... I guess you would need to either use
swallow_outputs
(like tests do) or add argument to return (yield) rendered version instead of the structure
Wow this is impressive! I'm guessing the wtf()
function returns the content between the > < and the rest is just printed to the screen for the user.
I'll give a shot at a headless call to a Helper that will take extra input (the output above) and try to create an issue. Given the size of the content, I am thinking this will be hard to do without a proper GitHub token, but it's worth a shot.
@yarikoptic so you would want to keep the user checking / validating what is being sent first (environment for example?) Right now there is a prompt built in to check.
okay I'm reading for posting and testing further, can we create the repo datalad-support here with giving me permission to write?
And here is an example issue - the details are generated by wtf (datalad) and the following sections are provided by helpme. https://github.com/vsoch/askci/issues/35. I'm writing up some quick docs to show you now for how that was generated.
Here is (non rendered) docs for how it works https://github.com/vsoch/helpme/pull/49/files#diff-4dd658e1156e1809a673713a07f7e534R80-R113 and TBA rendered
the
wtf()
function returns the content between the > < and the rest is just printed to the screen for the user.
somewhat -- between ><
I just printed what is actually returned (a dict structure which is not rendered, thus not really for inclusion as is into an issue intended for human consumption). That is why I thought that either we RF to provide a stream to .write
to, or just use swallow_output
to capture it.
@yarikoptic so you would want to keep the user checking / validating what is being sent first (environment for example?) Right now there is a prompt built in to check.
if it would be just an invocation of "new issue" via url to github, user would get all that information present for review / editing in the browser, before they hit "Submit new issue", even with the "Preview" tab - so we would not need any editing/visualization on our end ;)
can we create the repo datalad-support here with giving me permission to write?
I created https://github.com/datalad/datalad-helpme and invited you. Decided to not go with datalad-support
because we have datalad.support
module and thinking about extracting some "setup-support" functionality into a common reusable package, so it might be a bit confusing. We could rename to any other later on if desired
@yarikoptic I have only been testing using a token - the substantial size of the body has me guessing the un-authenticated version won't work - I'll mess around with it now to see how to encode the content for the url.
... details are generated by wtf (datalad) and the following sections are provided by helpme. vsoch/askci#35.
Cool! that wasn't via url + browser, but via token, right?
As I have mentioned above, I think ideally that markdown rendering we have would be better for human consumption than the structure, here is an example:
@yarikoptic how do I capture the markdown rendering? It's just printed to the screen.
And @yarikoptic what data would you like used to generate a hash identifier for the issue?
holy crap the non token way worked! Super cool :)
Needed
import urllib
body = urllib.parse.quote(body)
@yarikoptic so the one more piece of feedback I need is what content would you like to use for the hash? Once I know that I'll add it to the PR, and then test a GitHub workflow to handle a new issue. After that we can try testing from within datalad (the code in the README for datalad-helpme would basically be put somewhere in datalad).
@yarikoptic how do I capture the markdown rendering? It's just printed to the screen.
resolved per https://github.com/vsoch/helpme/pull/49/files#r358953557
... what content would you like to use for the hash?
I think the traceback would be the most reliable way. Ideally I think it should
E.g. for a random open issue with "exception": https://github.com/datalad/datalad/issues/2855 which lists following traceback:
Traceback (most recent call last):
File "/anaconda3/bin/datalad", line 8, in <module>
main()
File "/anaconda3/lib/python3.6/site-packages/datalad/cmdline/main.py", line 495, in main
ret = cmdlineargs.func(cmdlineargs)
File "/anaconda3/lib/python3.6/site-packages/datalad/interface/base.py", line 628, in call_from_parser
ret = list(ret)
File "/anaconda3/lib/python3.6/site-packages/datalad/interface/utils.py", line 422, in generator_func
result_renderer, result_xfm, _result_filter, **_kwargs):
File "/anaconda3/lib/python3.6/site-packages/datalad/interface/utils.py", line 491, in _process_results
for res in results:
File "/anaconda3/lib/python3.6/site-packages/datalad/distribution/publish.py", line 824, in __call__
**res_kwargs):
File "/anaconda3/lib/python3.6/site-packages/datalad/distribution/publish.py", line 313, in _publish_dataset
diff = True if force else has_diff(ds, refspec, remote, paths)
File "/anaconda3/lib/python3.6/site-packages/datalad/distribution/publish.py", line 93, in has_diff
remote_ref = '/'.join((remote, remote_branch_name))
TypeError: sequence item 1: expected str instance, NoneType found
we could get as a fingerprint (well -- catalogued under a checksum of it's serialization into json
, str
wouldn't be good enough due to all '
vs "
etc) following dictionary (ordered/sorted) with two keys:
{
'traceback': [
('datalad', 8, '<module>', 'main()'),
('datalad/cmdline/main.py", 495, 'main', 'ret = cmdlineargs.func(cmdlineargs)'),
...
('datalad/distribution/publish.py', 93, 'has_diff', "remote_ref = '/'.join((remote, remote_branch_name))")
],
'exception': ("TypeError", "sequence item 1: expected str instance, NoneType found")
}
where for exception it is a (exc.__class__.__name__, str(exc))
.
But line numbers would bring pain here since with minor change (elsewhere) they would shift, so may be we could instead omit them (would be less strict fingerprint), and store that one as the first tier and then the checksum of the one with line numbers as the 2nd tier (which would identify it exactly).
Makes sense? (there could be much better ways probably!)
PS oh -- an idea!!! it would be cool if we manage to annotate our "DB" of such records with version (git describe
) where it was fixed, so whenever we check for them and find it marked as fixed -- we could report to the user smth like
We have identified this error as a possible duplicate of an issue #1234 (http://github.com/datalad/datalad/issues/1234) which was fixed in 0.11.2-22-ga53a87c30, so please upgrade (the most recent release is 0.11.7).
Do you still want to report a (n)ew bug report or (c)ontinue [n/c]?
(yet possibly to introduce use of etelemetry from @satra and team here) . The last dialogue should be similar for issues which are not yet fixed:
We have identified this error as a possible duplicate of an issue #1234 (http://github.com/datalad/datalad/issues/1234).
Do you still want to report a (n)ew bug report, (a)dd to existing, or (c)ontinue [n/a/c]?
I’ll look at the traceback ASAP! I actually just sent you a Gitter message that suggested the same thing haha. To be clear, if we are not requiring a github token and using the API, we won’t be able to identify if the issue exists before hand. However with the GitHub workflow we should be able to immediately answer the issue with a similar message after it’s posted. I’m going to put together a very simple toy example to demonstrate what I have in mind this evening!
holy crap the non token way worked! Super cool :)
AWESOME! With that in mind, I think we might better even not bother with a separate repo, but rather point to this issue tracker!
Actually, our use case is even better! We have a number of "extensions" (e.g. https://github.com/datalad/datalad-container, https://github.com/datalad/datalad-crawler etc). We should "register" them within "datalad helpme" support. Depending on the traceback , we might need to ask user which repository to file against -- "datalad", "datalad-container", ... (if anything from extensions is in the traceback -- take the "deepest" as the one to suggest by default)
we won’t be able to identify if the issue exists before hand. However with the GitHub workflow we should be able to immediately answer the issue with a similar message after it’s posted.
nothing is impossible!! If github workflow catalogues all the issues (in datalad and its extensions) according to fingerprint within datalad-helpme (file tree, e.g. "issues/fingerprint-checksum.json", may be with some 1 level caching tear by using first two digits of the fingerprint), the helpme could check if such fingerprint is known already (quick non-auth query to github datalad-helpme tree of fingerprints) and get all information about it (which issue(s), fixed or not, etc). oh, this could be awesome! ;-)
Gitter message that suggested the same thing haha.
oh sorry -- whenever browser dies, all the web based social media goes with it and revives one at a time ;)
okey doke - the toy example (the issue submission bit) is underway! This script (using the helpme version 0.0.40 that is under pull request) will respond to an exception by opening up a browser window (no GitHub token required) and asking the user questions, and of course including metadata. A hash of some content (provided by the calling function, I chose exception metadata) is provided to give an identifier.
https://github.com/rseng/github-support
Next I'm going to add a workflow to that repo to respond to a new issue by getting the identifier, and either looking it up or saving the metadata. This part will be scoped beyond helpme, and can vary depending on how the implementer wants to roll it. What I'll likely do is just provide a lot of examples that folks can modify.
And I'll invite you guys to rseng! It's something I've had up my sleeve for a bit - I want to grow a small community of RSE developers that generally want to work together on projects. This small project that will be used for datalad, and provided generally as an example, is a perfect example.
okay toy example is totally done!
https://github.com/rseng/github-support
You can follow the instructions there to install the branch, and then just run ./example.py for your browser to open and write an issue. Given not changing example.py, it will generate the same error hash, and within about 20 seconds comment on the issue that it's already open (with link), and then close it. Take a look at the README.md there with a bunch of questions / suggestions for how you would want to implement for datalad-helpme.
I'm waiting on a conda package approval and then likely I'll merge this particular PR so you can actually install from pip, and I'm going to do a small blog post / write up to share the general tool. And @yarikoptic once you've tried the toy example, let's continue discussion of how you want the workflow to look for datalad-helpme.
@mih @yarikoptic could you point me to where in the datalad codebase you would want to catch some error and start the helpme flow? I saw there are exceptions in datalad/support/exceptions that mostly are based off of RuntimeError, but I think you had mentioned something in the cmdline module? I'm not familiar with it, so if you could walk me through an example use case (and then how it errors) that should be enough to hopefully get started.
https://github.com/datalad/datalad/blob/master/datalad/cmdline/main.py#L560 AFAIK the "best" point to introduce that
Okay we're close! @yarikoptic with the wtf output, it's too long to open an issue programmatically:
Is there a way to shorten it possibly, or just select a subset of attributes that are most valuable? Here is the current body:
## What is the problem?
<!-- Please write a few sentences about the issue-->
## What steps will reproduce the problem?
<!-- What triggered this error? -->
## Is there anything else that would be useful to know in this context?
<!-- Have you had any success using DataLad before? (to assess your expertise/prior luck. We would welcome your testimonial additions to https://github.com/datalad/datalad/wiki/Testimonials as well)-->
<details><summary>DataLad 0.10.3.1.dev3382 WTF (configuration, datalad, dataset, dependencies, environment, extensions, git-annex, location, metadata_extractors, python, system)</summary>
# WTF
## configuration <SENSITIVE, report disabled by configuration>
## datalad
- full_version: 0.10.3.1.dev3382-ge7e4-dirty
- version: 0.10.3.1.dev3382
## dataset
- id: None
- metadata: <SENSITIVE, report disabled by configuration>
- path: /home/vanessa/Desktop/Code/datalad
- repo: GitRepo
## dependencies
- appdirs: 1.4.3
- boto: 2.49.0
- cmd:annex: 7.20190819+git2-g908476a9b-1~ndall+1
- cmd:bundled-git: 2.20.1
- cmd:git: 2.20.1
- cmd:system-git: 2.23.0
- cmd:system-ssh: 7.6p1
- git: 3.1.1
- gitdb: 4.0.2
- humanize: 2.4.0
- iso8601: 0.1.12
- keyring: 21.2.1
- keyrings.alt: 3.4.0
- msgpack: 1.0.0
- requests: 2.23.0
- tqdm: 4.46.0
- wrapt: 1.12.1
## environment
- LANG: en_US.UTF-8
- PATH: /home/vanessa/anaconda3/bin:/home/vanessa/anaconda3/condabin:/home/vanessa/.rbenv/plugins/ruby-build/bin:/home/vanessa/.rbenv/shims:/home/vanessa/.rbenv/bin:/home/vanessa/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin
## extensions
## git-annex
- build flags:
- Assistant
- Webapp
- Pairing
- S3
- WebDAV
- Inotify
- DBus
- DesktopNotify
- TorrentParser
- MagicMime
- Feeds
- Testsuite
- dependency versions:
- aws-0.20
- bloomfilter-2.0.1.0
- cryptonite-0.25
- DAV-1.3.3
- feed-1.0.0.0
- ghc-8.4.4
- http-client-0.5.13.1
- persistent-sqlite-2.8.2
- torrent-10000.1.1
- uuid-1.3.13
- yesod-1.6.0
- key/value backends:
- SHA256E
- SHA256
- SHA512E
- SHA512
- SHA224E
- SHA224
- SHA384E
- SHA384
- SHA3_256E
- SHA3_256
- SHA3_512E
- SHA3_512
- SHA3_224E
- SHA3_224
- SHA3_384E
- SHA3_384
- SKEIN256E
- SKEIN256
- SKEIN512E
- SKEIN512
- BLAKE2B256E
- BLAKE2B256
- BLAKE2B512E
- BLAKE2B512
- BLAKE2B160E
- BLAKE2B160
- BLAKE2B224E
- BLAKE2B224
- BLAKE2B384E
- BLAKE2B384
- BLAKE2BP512E
- BLAKE2BP512
- BLAKE2S256E
- BLAKE2S256
- BLAKE2S160E
- BLAKE2S160
- BLAKE2S224E
- BLAKE2S224
- BLAKE2SP256E
- BLAKE2SP256
- BLAKE2SP224E
- BLAKE2SP224
- SHA1E
- SHA1
- MD5E
- MD5
- WORM
- URL
- operating system: linux x86_64
- remote types:
- git
- gcrypt
- p2p
- S3
- bup
- directory
- rsync
- web
- bittorrent
- webdav
- adb
- tahoe
- glacier
- ddar
- git-lfs
- hook
- external
- supported repository versions:
- 5
- 7
- upgrade supported from repository versions:
- 0
- 1
- 2
- 3
- 4
- 5
- 6
- version: 7.20190819+git2-g908476a9b-1~ndall+1
## location
- path: /home/vanessa/Desktop/Code/datalad
- type: dataset
## metadata_extractors
- annex:
- load_error: None
- module: datalad.metadata.extractors.annex
- version: None
- audio:
- load_error: No module named 'mutagen' [audio.py:<module>:17]
- module: datalad.metadata.extractors.audio
- datacite:
- load_error: None
- module: datalad.metadata.extractors.datacite
- version: None
- datalad_core:
- load_error: None
- module: datalad.metadata.extractors.datalad_core
- version: None
- datalad_rfc822:
- load_error: None
- module: datalad.metadata.extractors.datalad_rfc822
- version: None
- exif:
- load_error: No module named 'exifread' [exif.py:<module>:16]
- module: datalad.metadata.extractors.exif
- frictionless_datapackage:
- load_error: None
- module: datalad.metadata.extractors.frictionless_datapackage
- version: None
- image:
- load_error: None
- module: datalad.metadata.extractors.image
- version: None
- xmp:
- load_error: No module named 'libxmp' [xmp.py:<module>:20]
- module: datalad.metadata.extractors.xmp
## python
- implementation: CPython
- version: 3.7.4
## system
- distribution: debian/buster/sid
- encoding:
- default: utf-8
- filesystem: utf-8
- locale.prefered: UTF-8
- max_path_length: 306
- name: Linux
- release: 5.3.0-51-generic
- type: posix
- version: #44~18.04.2-Ubuntu SMP Thu Apr 23 14:27:18 UTC 2020
</details>
I removed the section on key/value backends and it seemed to work:
https://github.com/datalad/datalad-helpme/issues/3
Can we tell wtf to not include that?
Ahh looks like I can say which to include! http://docs.datalad.org/en/latest/generated/man/datalad-wtf.html#s-section-section-section
Both @yarikoptic and I had the impression that we do not get enough feedback on cases where datalad doesn't work as intended and crashes. One possibility to mitigate this situation would be to add an automated feedback submission ability to the cmdline interface's crash handler. It could format a document with:
Given that such a report could contain sensitive information, we should allow users to scan/edit such a report before it is sent (unless configured to be OK).
Maybe there is already some tool that would facilitate the server-side of things.