eksopl / fuuka

Fuuka Imageboard Archiver
http://code.google.com/p/fuuka/
Other
58 stars 7 forks source link

Fuuka API for post retrieving #27

Open eksopl opened 12 years ago

eksopl commented 12 years ago

JSON just kicked in, yo. Etc.

nstepien commented 12 years ago

If you could make an API so that it would work on the archives, and also from 4chan to work with linkified dead quotes. Archivers would need to allow CORS from boards.4chan.org of course.

eksopl commented 12 years ago

Sure. What kind of output do you want, JSON?

Something like: /api/<board>/post/<postnum>/ /api/<board>/thread/<postnum>/ /api/<board>/thread/<postnum>/deleted/ /api/<board>/thread/<postnum>/deleted/ghost/ /api/<board>/thread/<postnum>/all/ ?

nstepien commented 12 years ago

JSON?

That would be lighter network-wise, yeah.

/api/board/post/<postnum>/
/api/board/thread/<postnum>/
/api/board/thread/<postnum>/deleted/
/api/board/thread/<postnum>/deleted/ghost/
/api/board/thread/<postnum>/all/

post is enough, since I cannot know from which thread a post number comes from.

I personally don't see the point in getting all the deleted posts from a thread. See: accelspam, deleted posts that were not quoted because it wasn't any interesting; deleted posts are usually deleted for good reasons.

Also pinging @woxxy and @oohnoitz so that we can all agree on something universal.

eksopl commented 12 years ago

Getting deleted posts from a thread can be hilarious as hell, when you find someone who posted something on accident and then deleted it. But the anti-ghostbump post deletion delay on 4chan made that less likely to happen. There's also that thing when a mod goes on a rampage, too. Far too often I've had the live thread and the archive opened side by side to check for deleted posts.

Of course, I'm used to boards like /jp/ where spam is carried out by posting new threads and not usually done by hijacking existent threads, so other boards might be different.

It would be ultra-taxing on the archivers if enabled on something like 4chan X by default, though. I'm just thinking in terms of a generic API, I don't think that feature should be in 4chan X. Quote hovering for dead posts should probably be fine, but I'd like to hear @woxxy, @oohnoitz and @GXTX.

If I implement an API, I'll probably still support all that kind of functionality, for the sake of being used in smaller scripts with a lower profile than 4chan X, perhaps with config options for the server admin to disable certain requests. I was pretty fond of this one, for example.

nstepien commented 12 years ago

Uses data from Fuuka archiver to display tripfriend post-counts

Haha, oh woaw.

woxxy commented 12 years ago

We already have an api that goes like

/api/chan/thread/board//num/ or /api/chan/thread/?board=&num= And other functions I don't remember right now.

I need to add per-post request in this case. If it's just for on-hover of dead posts it's no issue as in server load. Notice that we'll also support a separate domain like archive-sys.foolz.us/api/ to be able to have boards with reserved names like /admin/ (not that we need it, but who knows if someone wants to make a board called /api/). Make sure you can support separate domain on 4chan X.

nstepien commented 12 years ago

Make sure you can support separate domain on 4chan X.

What do you mean?

eksopl commented 12 years ago

I believe he means that you should support XHRing from archive-sys.foolz for boards which archive is located at archive.foolz.

nstepien commented 12 years ago

See

Archivers would need to allow CORS from boards.4chan.org of course.

eksopl commented 12 years ago

Nah, I don't think he's thinking that far ahead, he just means that you'll need some kind of map in the script to keep the html => "archive.foolz.us", json => "archive-sys.foolz.us" mapping, so you don't request the wrong domain. You know, trivial stuff.

What's wrong with GM_xmlhttpRequest, by the way?

nstepien commented 12 years ago

You know, trivial stuff.

Okay.

What's wrong with GM_xmlhttpRequest, by the way?

Not portable. It only works for Scriptish/GM, maybe Scriptify.

eksopl commented 12 years ago

Chrome's native userscript handler supports it. Blank Canvas Script Handler and Tampermonkey also support it.

Only Opera doesn't, and Opera doesn't support CORS either.

I mean, sure, it's non-standard, but everywhere where CORS would work, GM_xmlhttpRequest also does. Other than Safari, maybe? Does Safari even support userscripts?

I'll put the CORS header in the API because it's the right thing to do, but as a practical approach, you don't lose anything by using the GM_xmlhttpRequest method.

nstepien commented 12 years ago

Opera doesn't support CORS

Opera 12 will support CORS, Opera Mobile 12 is already out as stable and does support it. http://caniuse.com/cors

Does Safari even support userscripts?

see http://blog.neozeed.net/4chan-x-for-safari http://archive.rebeccablacktech.com/4klaani/g/?task=search&search_text=ninjakit I don't know about GM_xmlhttpRequest support though.

I'll put the CORS header in the API because it's the right thing to do

Good.

but as a practical approach, you don't lose anything by using the GM_xmlhttpRequest method.

The method to use GM_xmlhttpRequest is different than normal XMLHttpRequest, and I don't want to maintain different code for different userscript implementations.

nstepien commented 12 years ago

Before I forget about it, this is what the JSON'd object should contain, at least for 4chan X's use:

thread id: int [1]
post id: int [1? we don't necessarily need it if we always have the post id to begin with.]

name: string [1]
trip: string [0,1]
user id: string [0,1]
mail: string [0,1]

time: string? [1] (4chan localized time, 4chan time format)
comment: string [1] (directly as the 4chan's HTML would be? It needs to work with spoilers, moot tags, /tg/ dice rolls, /p/'s exif data (if these are archived), etc...)

img: object [0,1]
  real filename: string [1]
  sorther real filename: string [0,1? maybe construct it on 4chan X's side]
  4chan filename: string [1]
  thumbnail src: string [1] (archived one)
  full src: string or boolean [0,1] (I can construct if from the 4chan filename, but I need to know wether it is archived or not.)
  spoilered: boolean [1]
  dimension: string [1]
  filesize: string [1]
  md5: stirng [1]
  thumbnail height: int [1]
  thumbnail width: int [1]

Hopefully I didn't miss anything.

woxxy commented 12 years ago

I guess we'll stick with returning the objects with the database names.

Here's an example from a thread. Ignore the formatted part, it's internal, so I'll make so that's opt-in. I'll keep the _processed there since some of the processing might possible only on-server. They might not be useful for you though. http://archive.foolz.us/api/chan/thread/board/a/num/63272253/format/xml http://archive.foolz.us/api/chan/thread/board/a/num/63272253/format/json I'll get you a single-post function soon. It will be /api/chan/post/board/a/num/63272253_12 where the _12 is for ghostposts, and if not added means 0 which means it's not a ghost post.

You will have to deal with BBC tags and parsing backlinks to actual links. Fuuka and FFuuka will surely return the same BBC. https://github.com/eksopl/fuuka/blob/master/Board/Yotsuba.pm#L405-429 or https://github.com/eksopl/asagi/blob/master/src/main/java/net/easymodo/asagi/Yotsuba.java#L122-146 Whichever reads easier.

woxxy commented 12 years ago

Here we go, why not just add the function now.

http://archive.foolz.us/api/chan/post/board/a/num/63272253/format/xml http://archive.foolz.us/api/chan/post/board/a/num/63272253/format/json

You will get an error field in case there's any kind of error and it will contain a human readable explanation. 404 on every kind of error currently since all it's possible is:

nstepien commented 12 years ago

Good good. I'll wait for eksopl's opinion on it though, I wouldn't want to maintain different implementations.

Why is there no imageboard standards working groups yet?

woxxy commented 12 years ago

Which origins do I have to allow?

nstepien commented 12 years ago

boards.4chan.org. Dunno if you have to specify the protocols (http and https).

woxxy commented 12 years ago

Doen.

fit-bear commented 12 years ago

this makes we can see deleted posts and ghost posts fetched by woxxy on 4chan? that's so cool!

woxxy commented 12 years ago

No it doesn't. To do that we would need a much more powerful server for the archives. This only fetches single posts when you hover on backlinks.

eksopl commented 12 years ago

I'm not entirely sure if there's a point of exposing the doc_id. I can't see any bad implications from that, though.

I'd also prefer returning the time as an int in UTC. Performing a proper EST -> UTC is easy enough to do server-side, but it's not so straightforward to do it on clients (as @woxxy found out when the US changed to DST, wwww), since it involves stuff like tz databases. It is NOT just +5, you will need to use a library of some kind to perform that conversion properly, so it's much easier for the server to do it.

I am okay with the other simple fields as they seem to just be just taken straight out of the database.

thumbnail_href and image_href are useful and make sense. Personally, I'd prefer something like thumbnail_link and image_link, though. Or to be consistent with fuuka internals, it'd probably be thumb_link and media_link.

remote_image_href and safe_media_hash also make sense. Again, I'd prefer remote_media_link, but it's very much a non-issue.

I can't really support _processed fields on my end, as fuuka does all of its sanitizing on database insert (other than making media_hash URL-safe). I'm okay with defining formatted as how the post's HTML would be generated in the archive site, so XMLHttpRequest backlinking implementations can use this. Requests will need a theme parameter for foolfuuka, though, won't they?

woxxy commented 12 years ago

They're changed to _link versions on dev of FFuuka. Not sure when it will go live.

nstepien commented 12 years ago

@woxxy Mind filling me with updates on this? Has anything changed? I don't feel like waiting anymore for fuuka's implementation.

eksopl commented 12 years ago

I totally forgot about this, but sure, I'll follow FF's specification. If something about it ends up hurting my sensibilities too much, we're coordinated enough that we can both change, so.

eksopl commented 12 years ago

By the way, if you ever require fast answers from either the FF guys or me and you don't mind using IRC, #fooldriver at irchighway is probably the fastest way.

nstepien commented 12 years ago

All right~

oohnoitz commented 12 years ago

@MayhemYDG The only changes done to the API is the renaming of the column/key names. I will list below what some of these column/key holds to avoid any confusion. The rest of the columns/keys should be self-explanatory.

preview_orig - this is the 4chan filename media_orig - this is the 4chan filename media_filename - this is the filename of the image uploaded from the user safe_media_hash - this is the media hash used in many of our links to avoid the need of url encode

nstepien commented 12 years ago

@oohnoitz

preview_orig - this is the 4chan filename media_orig - this is the 4chan filename

So I assume preview_orig is the 4chan filename displayed on board pages and media_orig is the actual 4chan filename?

@eksopl May I remind you about issue #23?

eksopl commented 12 years ago

preview_orig is the original thumbnail filename on 4chan, media_orig the image filename. preview_orig = 1678456348s.jpg media_orig = 1678456348.png media_filename = smugsion.png

It's called media rather than image because in theory 4chan can host stuff like PDFs.

Also, that one is still pending on writing a proper DB+image migration script for the scheme FF uses now (no duplicated images). Next three weeks I'm going to be basically gone, so ETA end of this month.

oohnoitz commented 12 years ago

If you need to load the thumbnail with the quote preview, you should use the value for thumb_link in the API. Furthermore, media_link will only return a value if it exists on the server.

nstepien commented 12 years ago

How do you guys handle >quotes from the raw text into spans? How the dumper converts html into text and stores it doesn't help me on that one.

woxxy commented 12 years ago

@MayhemYDG https://github.com/FoOlRulez/FoOlFuuka/blob/master/application/models/post_model.php#L889-891

Basically:

$find = "'(\r?\n|^)(&gt;.*?)(?=$|\r?\n)'i";
$html = '\\1<span class="greentext">\\2</span>\\3';
$comment = preg_replace($find, $html, $comment);
nstepien commented 12 years ago

That doesn't look like it would work with hurr[spoiler][/spoiler]>durr. Do you do it before or after reHTMLing?

nstepien commented 12 years ago

I also need a clarification: title is for textContent and title_processed is for innerHTML, right?

woxxy commented 12 years ago

Our regex makes the greentext work only if the line starts with a >. But then, I just tried on 4chan. Is that sequence a trick to greentext? If that's the case, I guess we should have it as well at least for the archive boards.

title_processed is title, but with invisible characters removed, RTL characters removed and HTML entities converted (htmlentities) to prevent display tricks and HTML injection.

Both of them (all of the string variables from users) are passed under iconv with //IGNORE flag to strip characters that aren't UTF-8.

nstepien commented 12 years ago

Is that sequence a trick to greentext?

Yes, you newfag.

woxxy commented 12 years ago

;_;

nstepien commented 12 years ago

The only possible "capcode" values are "N", "A", and "M", right? There really needs to be a wiki page in one of you guys' repository for explanations. Especially when you change things around like subject -> title.

woxxy commented 12 years ago

N is a normal user, A is Admin, M is mod, G is global mod.

About documentation, I'll write some more later. I've just started covering developers' part.

nstepien commented 12 years ago

G is global mod

explain further

eksopl commented 12 years ago

G is God. It's a joke. It puts little flowers all around your name in Fuuka, it's pretty much unused.

woxxy commented 12 years ago

It's an original Fuuka specification, but with your reaction I guess you never seen it happening either. From the table creation SQL:

 capcode enum('N', 'M', 'A', 'G') NOT NULL DEFAULT 'N',

@oohnoitz did not include G in the search system so my guess is that we don't have a G in the database. I am running a query right now to make sure we don't have something from the past decade, but it will take a while since I am not using SphinxSearch.

You can completely ignore G. If the query returns an entry, I'll notice you.

woxxy commented 12 years ago

@eksopl ...what. Ah well. Shutting down the query.

And getting rid of the Global Mod lines in the theme too I guess.

nstepien commented 12 years ago

I see I see.

eksopl commented 12 years ago

If there's any G entries, they'll be ghost posts only. You can safely treat G as N if FoolFuuka/4chan X. It's more of an easter egg than anything else.

nstepien commented 12 years ago

It's pretty much usable now, feel free to try it. I haven't done the mobile user info stuff yet. The date text when not formatted by 4chan X is not formatted like 4chan yet. I'm not sure how I'll handle spoilered images.

nstepien commented 12 years ago

As for handling comment tags and greentext, I went with, in order:

That should match 4chan's HTML correctly.

eksopl commented 12 years ago

Nice. I moved the original issue this was about over to GH-54 the and renamed this one.

I might have bug tracking OCD.