emory-libraries / dlp-lux

Discovery for the DLP Cor Repository
2 stars 0 forks source link

Trigger a Google crawl for Lux #503

Closed nikdragovic closed 4 years ago

nikdragovic commented 4 years ago

Once HTTP auth is removed, trigger a crawl via Google Search Console or Emory-preferred workflow.

What team is responsible for this process, and can be consulted in the future? Is it possible to submit a Blacklight sitemap, and is anyone aware of the status on community developments?

eporter23 commented 4 years ago

Looks like we need to add a verification code for Google webmaster tools/Google Search Console in our DNS configuration. I have a generated a code which can be referenced in Box.

libdgg commented 4 years ago

I'm checking on the logistics of this. -Doug

libdgg commented 4 years ago

I am verifying through the LTDS google account where we have all the other web properties which means the site verification code is different than the one in Box. I have a ticket in for the verification via DNS request INC03371987

libdgg commented 4 years ago

@eporter23 I have an update on this that I need to run by you and the team. I'll add it to the dlp-launch ticket for our change management meeting discussion. -DG

libdgg commented 4 years ago

I closed/canceled the ticket INC03371987 since I need to get with Emily as a next step to determine if we want to proceed with DNS verification of "library.emory.edu" + URL verification for "digital.library.emory.edu" or approach this in another way. I'll resubmit a new ticket once we determine if we are using DNS verification or not.

libdgg commented 4 years ago

FYI for further discussion about Google Search Console + Google Analytics work

I see that our Google Analytics are setup for digital.library.emory.edu here https://analytics.google.com/analytics/web/?authuser=1#/report-home/a164499118w39366856p39046093

The question right now is how best to verify the property digital.library.emory.edu for Google Search Console. Unless there are other insights or objections, I'm inclined to go with domain verification for "library.emory.edu" per instructions from the Emory DNS folks and the do url verification for "https://digital.library.emory.edu" per what I've read on the web regarding subdomains and domain verification.

mark-dce commented 4 years ago

@libdgg from what I can see, if you have the top-level domain confirmed, you can ask google to crawl any subdomain without needing to re-validate the domain. I just tried with our console and here's the steps I'm seeing:

OPEN THE GOOLE SEARCH CONSOLE Here's the main console for our top-level domain - curationexperts.com

image

ENTER THE URL FOR THE SITE WITHIN YOUR DOMAIN YOU WANT TO INDEX tenejo is our demo Hyrax repository, it's reached as a subdomain of our primary domain (like digital is a subdomain of library.emory.edu) NOTE: you will need to wait to do this until HTTP Authentication is turned off

image

REQUEST REINDEXING FROM THE URL INSPECTION PAGE

image image

CHECK BACK ON PROGRESS PERIODICALLY Asking google to index the site without a sitemap will take some time, so you'll want to check in periodically to make sure no issues are encountered.

image image
mark-dce commented 4 years ago

You can ensure a more thorough indexing by periodically submitting a full sitemap - there's some breadcrumbs here to automate sitemap generation https://github.com/projectblacklight/blacklight/wiki/Search-engine-harvesting

The blacklight list on google groups or the Code4Lib slack organization might be able to point you to more sample code for automating the submission of the sitemap to google.

mark-dce commented 4 years ago

@libdgg To see more detail about google's indexing, you'll want to add the site as persistent web 'property':

SEE WHICH SITES ARE CONFIGURED

image

ADD A NEW PROPERTY (SITE) WITHIN YOUR DOMAIN

image

ADD SUBDOMAINS USING THE URL PREFIX

image

For example, if we wanted to start indexing our dev site...

image

Because you (Emory's Google Search account) already owns the parent domain, you should be good to go

image

CHECK OUT INDEXING COVERAGE FOR YOUR SITE Notice that we recently cleaned out a lot of sample items that folks had created over time during conferences & workshops - hence the high excluded count

image
libdgg commented 4 years ago

Thanks Mark. I have submitted ticket INC03376253 to have the Emory DNS team verify the primary domain "library.emory.edu". Once that is done I plan to add the "digital.library.emory.edu" property per the info above from Mark.

libdgg commented 4 years ago

DNS verification complete for "digital.library.emory.edu" and "library.emory.edu". The Google Search Console shows both properties so it looks like everything worked.

tmiles2 commented 4 years ago

@libdgg Can we close this ticket?