HW-SWeL / BMUSE

Bioschemas Mark Up Scraper and Extractor
https://app.swaggerhub.com/apis-docs/swel/BMUSE/
Apache License 2.0
3 stars 5 forks source link

Stats page #13

Closed kcmcleod closed 5 years ago

kcmcleod commented 5 years ago

How many pages scraped? How many DataCatalogs? etc

AlasdairGray commented 5 years ago

Ideally this page would replace the live deploys page.

kcmcleod commented 5 years ago

Current draft: https://lxbisel.macs.hw.ac.uk:8080/EE-WebApp/stats

Not pretty. Needs sorting. Perhaps a pie chart? Possibly a list of top level URLS?

AlasdairGray commented 5 years ago

Probably want some high level stats, e.g. number of resources (databases), number of pages, number of pages by type. I'm not sure that the number of triples is that meaningful.

For the by type we may want to focus the list to Bioschemas types of interest, i.e. omit blog.

Probably no need to show the full URLs of types, just their hyperlinked names would suffice. Can we order the types by their name or even let the user dynamically sort by name or size.

We possibly want some ways of doing into which resources have been marked up with which types and then links to structured testing tool for an example page, similar to the live deploys page. We want to make it easy for people to get to the markup of a resource so they can copy and hack it to their needs.

We could also have some way of getting a list of all URLs that been indexed.

http://www.macs.hw.ac.uk/~ajg33


From: Ken McLeod notifications@github.com Sent: Thursday, June 6, 2019 3:00:46 PM To: HW-SWeL/Scraper Cc: Gray, Alasdair J G; Comment Subject: Re: [HW-SWeL/Scraper] Stats page (#13)

Current draft: https://lxbisel.macs.hw.ac.uk:8080/EE-WebApp/stats

Not pretty. Needs sorting. Perhaps a pie chart? Possibly a list of top level URLS?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/HW-SWeL/Scraper/issues/13?email_source=notifications&email_token=AAIWUEN5F7NFDW2Z3DHHKC3PZEKA5A5CNFSM4HMYZYD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXC5XYA#issuecomment-499506144, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAIWUENDD4L6MEFQERVO3A3PZEKA5ANCNFSM4HMYZYDQ.


Heriot-Watt University is The Times & The Sunday Times International University of the Year 2018

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences. This email is generated from the Heriot-Watt University Group, which includes:

  1. Heriot-Watt University, a Scottish charity registered under number SC000278
  2. Edinburgh Business School a Charity Registered in Scotland, SC026900. Edinburgh Business School is a company limited by guarantee, registered in Scotland with registered number SC173556 and registered office at Heriot-Watt University Finance Office, Riccarton, Currie, Midlothian, EH14 4AS
  3. Heriot- Watt Services Limited (Oriam), Scotland's national performance centre for sport. Heriot-Watt Services Limited is a private limited company registered is Scotland with registered number SC271030 and registered office at Research & Enterprise Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.

The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.

AlasdairGray commented 5 years ago

Stats page is returning an error

Screen Shot 2019-08-30 at 14 02 21
kcmcleod commented 5 years ago
2019/09/03 09:16:08 [error] 3384#0: *2073 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 137.195.27.39, server: lxbisel.macs.hw.ac.uk, request: "GET /EE-WebApp/stats HTTP/1.1", upstream: "http://127.0.0.1:8081/EE-WebApp/stats", host: "lxbisel.macs.hw.ac.uk:8080", referrer: "https://github.com/HW-SWeL/Scraper/issues/13"
kcmcleod commented 5 years ago

Stats page is returning an error

Need to pre calculate answers in a summary graph... todo!

For the by type we may want to focus the list to Bioschemas types of interest, i.e. omit blog.

Done.

Probably no need to show the full URLs of types, just their hyperlinked names would suffice

Done

Can we order the types by their name or even let the user dynamically sort by name or size.

Done: https://github.com/HW-SWeL/BSKgE/commit/0b19b2f73687c221e271325344b6963e5aa498c6

We possibly want some ways of doing into which resources have been marked up with which types and then links to structured testing tool for an example page, similar to the live deploys page. We want to make it easy for people to get to the markup of a resource so they can copy and hack it to their needs.

Search by type? Or do you mean something more complex? Making it easy to copy markup may have issues. Firstly you propagate junk. Secondly, copyright.

We could also have some way of getting a list of all URLs that been indexed.

If you mean sites OK. If you actually mean URLs I imagine that is way too slow...

kcmcleod commented 5 years ago

We possibly want some ways of doing into which resources have been marked up with which types and then links to structured testing tool for an example page, similar to the live deploys page. We want to make it easy for people to get to the markup of a resource so they can copy and hack it to their needs.

Moving this into a new issue, as rest done