Sotera / DatawakeDepot

Loopback web application for administration of Datawake networks
Apache License 2.0
9 stars 7 forks source link

Memex Domain Export. #37

Closed bwhiteman closed 8 years ago

bwhiteman commented 8 years ago

We need to be able to export a domain for use by the crawling teams. The domain should be the aggregate of all of the trails within a domain (At some point we may want to be able to choose specific trails.)

The format should be similar to

{
  "urls": ["http://la.backpage.com/1234", "http://la.backpage.com/1234", "http://www.wikipedia.com"], //All visited urls 
  "topLevelDomains": ["http://la.backpage.com"], //common top level domains 
  "searchTerms": ["escorts","las angeles","massage"], //all search terms
  "domainEntities": ["cherry","mimi","pasadena"], //entities added by a user.
  "domainEntityTypes": ["person","place","bitcoin address"],  //entity types added by a user.
  "commonEntities": ["massage", "parlor","las angeles","pasadena"]  //top 20 most extracted entities.
}
michaelsframe commented 8 years ago

We should include the domain Id.

I'd rename "topLevelDomains" to "rootUrls". We'll need to do a search for unique ones which should be easy enough.

I understand SearchTerms, Domain Entity Types, and Domain Entities. What are commonEntities?

We'll need to add the url relevance to the urls array. I do think we should include ids on each appropriate item (overall domain,urls, domain entities) in case we want to retrieve results from a third party who uses this export

bwhiteman commented 8 years ago

rootURLS is fine. commonEntities would be the most extracted entities. I'm not sure how useful this will be but we can see.

I agree everything should probably be an object array of {"id": "1223", "value": "value"}

michaelsframe commented 8 years ago

since we don't currently have commonEntites, can we just include ExtractedEntities (with id, occurences, url extracted from, etc)?

bwhiteman commented 8 years ago

That's what I meant, just pick an arbitrary number like the top 20 extracted entities with the highest counts.

bmcdougald commented 8 years ago

Looks like it's working as described. JSON exports and has all the sections with appropriate content. Couldn't get domain items manually added because that panel stuff was in a different branch, but the section was there in the JSON.

bwhiteman commented 8 years ago

test-domain.json.txt

@bmcdougald @michaelsframe When I try this on a larger domain, I only get a small amount of the domain, with no logging.

bwhiteman commented 8 years ago

Fixed in 100/90