FinalsClub / karmaworld

KarmaNotes.org v3.0
GNU Affero General Public License v3.0
7 stars 6 forks source link

Can web crawlers read our notes? #309

Closed charlesconnell closed 10 years ago

charlesconnell commented 10 years ago

Obviously, we want them to be able to. I'm not sure how well they can read what's inside the note iframe, especially since it's loaded with an ajax request after the page loads. @AndrewMagliozzi There is a feature on Google webmaster tools that allows you to see what the Googlebot will see when it crawls your page. I can't do it since it requires a verified login (proving that I control the site), so I wanted to let you set that up, if you haven't already.

btbonval commented 10 years ago

More search engines than google in the world. I read somewhere that Google might index IFRAMEs, while other sites only index the ANCHOR HREFs.

If the download button is a normal ANCHOR HREF and not some Javascript hackery to impersonate a ANCHOR HREF, we should be fine. Checking with Google does not ensure our junk works with all SEO, but it is a good thing to check because google is the most well known. -Bryan

On Sun, Jan 26, 2014 at 4:31 PM, Charles Connell notifications@github.comwrote:

Obviously, we want them to be able to. I'm not sure how well they can read what's inside the note iframe, especially since it's loaded with an ajax request after the page loads. @AndrewMagliozzihttps://github.com/AndrewMagliozziThere is a feature on Google webmaster tools that allows you to see what the Googlebot will see when it crawls your page. I can't do it since it requires a verified login (proving that I control the site), so I wanted to let you set that up, if you haven't already.

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/309 .

AndrewMagliozzi commented 10 years ago

Let's discuss together tomorrow.

On Sun, Jan 26, 2014 at 4:57 PM, Bryan Bonvallet notifications@github.comwrote:

More search engines than google in the world. I read somewhere that Google might index IFRAMEs, while other sites only index the ANCHOR HREFs.

If the download button is a normal ANCHOR HREF and not some Javascript hackery to impersonate a ANCHOR HREF, we should be fine. Checking with Google does not ensure our junk works with all SEO, but it is a good thing to check because google is the most well known. -Bryan

On Sun, Jan 26, 2014 at 4:31 PM, Charles Connell notifications@github.comwrote:

Obviously, we want them to be able to. I'm not sure how well they can read what's inside the note iframe, especially since it's loaded with an ajax request after the page loads. @AndrewMagliozzi< https://github.com/AndrewMagliozzi>There is a feature on Google webmaster tools that allows you to see what the Googlebot will see when it crawls your page. I can't do it since it requires a verified login (proving that I control the site), so I wanted to let you set that up, if you haven't already.

— Reply to this email directly or view it on GitHub< https://github.com/FinalsClub/karmaworld/issues/309> .

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/309#issuecomment-33331860 .

charlesconnell commented 10 years ago

I've taken the rel=canonical links out of prod. New idea: stick the plain text of the note in a noscript tag underneath the iframe. This is probably(?) SEO friendly. It might still seem deceptive and therefore rank us low, but I think we should try this and see how it goes for now.

AndrewMagliozzi commented 10 years ago

I know we dismissed the idea in our meeting, but I would like to explore the option of sanitizing and displaying this HTML without an Iframe.

The problem is that HTML returns from gdrive with head tags and all. Pure HTML in a div or span would be optimal for SEO and more. So let's at least see what we can do to make it happen.

On Feb 2, 2014, at 6:14 PM, Charles Connell notifications@github.com wrote:

I've taken the rel=canonical links out of prod. New idea: stick the plain text of the note in a noscript tag underneath the iframe. This is probably(?) SEO friendly. It might still seem deceptive and therefore rank us low, but I think we should try this and see how it goes for now.

— Reply to this email directly or view it on GitHub.

btbonval commented 10 years ago

In theory, ignoring for edits, the page that wraps around the note should reference things like the course the note came from and possibly the school. The less we have on the note details page, the better.

But if we change the top bar, the side panels, or rename the course, the note page will display a sudden break in flow.

That assumes the large HTML notes remain statically hosted (which I really think they need to be). The alternative is for the server to fetch the notes off S3, wrap the HTML around the HTML (yup, I said that right), and then push it to the client. This will serialize the currently parallel process of contacting both our server and S3, which will add delay on top of whatever delay there is from the processing which must be done server side. Also, those large PDF to HTML notes might be pretty big hits to server memory.

AndrewMagliozzi commented 10 years ago

Hey all, let's meditate on this and discuss during our group meeting tomorrow.

On Mon, Feb 3, 2014 at 4:29 AM, Bryan Bonvallet notifications@github.comwrote:

In theory, ignoring for edits, the page that wraps around the note should reference things like the course the note came from and possibly the school. The less we have on the note details page, the better.

But if we change the top bar, the side panels, or rename the course, the note page will display a sudden break in flow.

That assumes the large HTML notes remain statically hosted (which I really think they need to be). The alternative is for the server to fetch the notes off S3, wrap the HTML around the HTML (yup, I said that right), and then push it to the client. This will serialize the currently parallel process of contacting both our server and S3, which will add delay on top of whatever delay there is from processing must be done server side. Also, those large PDF to HTML notes might be pretty big hits to server memory.

Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/309#issuecomment-33935230 .

charlesconnell commented 10 years ago

Add rel="canonical" to note contents.

charlesconnell commented 10 years ago

Done. Run manage.py add_canonical_link to update old notes. New notes will have the link added to them automatically.

btbonval commented 10 years ago

The code looks good from a quick review. I noticed in one commit you also switched to using local static hosting for dev, so I have to wonder if you tested this code against a running static S3 instance?

On Sat, Feb 8, 2014 at 4:20 PM, Charles Connell notifications@github.comwrote:

Done. Run manage.py add_canonical_link to update old notes. New notes will have the link added to them automatically.

Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/309#issuecomment-34556601 .

charlesconnell commented 10 years ago

This was tested against S3. Even with that change, notes are still stored/retrieved from S3. Just not other static assets.

btbonval commented 10 years ago

Excellent. Did you run this against beta's bucket or the dev bucket (or both)?

We should schedule a time to post-process prod's bucket with @AndrewMagliozzi

charlesconnell commented 10 years ago

Just the dev bucket.

btbonval commented 10 years ago

Cool. Running it on beta will be a good test case for how it will perform on prod. I can't imagine why it would be different, but systems are complex and unpredictable.

I'll make sure to run this on beta at some point before running it on prod.

@AndrewMagliozzi we might want to setup glacier now that we have metric tonnage of OCW notes.

ghost commented 10 years ago

Hey everyone,

I just pulled the latest update and replace the files in the "secret" folder with the official ones from the flash drive. Running first_deploy on the virtual machine, however, is giving me this error:

[image: Inline image 1]

Anyone know what this could be about? Not sure how exactly to troubleshoot syncdb.

Regards,

William

charlesconnell commented 10 years ago

@Mo1ok Hey William. Please resend that message as a plain email to finalsclub-dev@finalsclub.org. You just commented on a Github ticket, so the image didn't come through.