Contribute additional scrapers

domxch commented 8 years ago

How do I contribute by adding additional scrapers? Is there any guidance regarding the objects to be created?

DoctorD1501 commented 8 years ago

Hey Domxch,

To add a new scraper, create a new class in the moviescraper.doctord.controller.siteparsingprofile.specific package which extends the SpecificProfile class. You will need to implement methods like scrapeTitle() etc for each of the fields you want to have come in. Scraping is usually done using the jsoup class - you can google that library or browse the other scraper classes already written for examples on how to use it.

When creating the scraper class there are two methods that need additional explanation:

createSearchString - the URL get request you want to submit to the site. Usually the site has some sort of search API get request - you take part of the file name and pass this into the parameter. You can see some of the other scrapers to see how this works.

getSearchResults - Once you are on the search result page, you need to return SearchResult objects for each movie item on the page that applies. Returning multiple objects here gives the user a choice on which search result to use, but you only technically need to return one object in this array if you want. This method usually involves using Jsoup to perform selectors to get the HTML tag in the page you want and then constructing the object using properties of that tag. You can optionally even provide an image URL to the search result constructor to get a nice little preview if the user has chosen to pick the search result manually while scraping.

To get the scraper to show up in the menu, simply placing a class in the moviescraper.doctord.controller.siteparsingprofile.specific of the right type should be enough since I use some methods to get a list of all classes in the package and create a menu option for it. If you want to provide an icon, you can put it in the res/sites folder. The file name needs to be the name of the class you wrote minus the words "ParsingProfile". So a class called JavLibraryParsingProfile should have an icon called JavLibrary.png. You can take a look at the files in there to get an idea of the dimensions.

On the off chance that your site has a json API, you can take a look at the "TheMovieDatabaseParsingProfile" class for an example on how to implement a scraper for those sites, as it's a bit different than doing a traditional scraper through jsoup.

Lastly, if you wish to make your new scraper available from the command line, take a look at updating Main.java. You'll want to add it to returnParsingProfileFromCommandLineOption and also update the help message so people know they can even use it in main().

Just let me know if you have any other specific questions as you go and I can help you out.

-DoctorD

domxch commented 8 years ago

Here is a scraper for lesshin - was super-easy to create. I'm new to GitHub so I'm posting here so I don't mess anything up - besides my version of JAVMovieScraper is now a heavily bastardised version with connection to a database for filtering movies and some basic url caching etc... Any improvements to suggest then let me know, but this works so far on my modest selection of titles...

package moviescraper.doctord.controller.siteparsingprofile.specific;

import java.io.File; import java.io.IOException; import java.net.MalformedURLException; import java.net.URL; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern;

import org.apache.commons.io.FilenameUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;

import moviescraper.doctord.controller.languagetranslation.Language; import moviescraper.doctord.controller.siteparsingprofile.SiteParsingProfile; import moviescraper.doctord.model.SearchResult; import moviescraper.doctord.model.dataitem.Actor; import moviescraper.doctord.model.dataitem.Director; import moviescraper.doctord.model.dataitem.Genre; import moviescraper.doctord.model.dataitem.ID; import moviescraper.doctord.model.dataitem.MPAARating; import moviescraper.doctord.model.dataitem.OriginalTitle; import moviescraper.doctord.model.dataitem.Outline; import moviescraper.doctord.model.dataitem.Plot; import moviescraper.doctord.model.dataitem.Rating; import moviescraper.doctord.model.dataitem.ReleaseDate; import moviescraper.doctord.model.dataitem.Runtime; import moviescraper.doctord.model.dataitem.Set; import moviescraper.doctord.model.dataitem.SortTitle; import moviescraper.doctord.model.dataitem.Studio; import moviescraper.doctord.model.dataitem.Tagline; import moviescraper.doctord.model.dataitem.Thumb; import moviescraper.doctord.model.dataitem.Title; import moviescraper.doctord.model.dataitem.Top250; import moviescraper.doctord.model.dataitem.Trailer; import moviescraper.doctord.model.dataitem.Votes; import moviescraper.doctord.model.dataitem.Year;

public class LesshinParsingProfile extends SiteParsingProfile implements SpecificProfile {

private String englishPage;
private String japanesePage;

Document japaneseDocument;

@Override
public String getParserName() {
    return "LESSHIN";
}

public LesshinParsingProfile()
{
    super();
}

@Override
public Title scrapeTitle() {

    Element titleElement = document.select("div.title_text > div").first();
    return new Title(titleElement.text());
}

@Override
public OriginalTitle scrapeOriginalTitle() {
    if(scrapingLanguage == Language.JAPANESE)
        return new OriginalTitle(scrapeTitle().getTitle());
    else
    {
        Document originalDocument = document;
        document = japaneseDocument;
        OriginalTitle originalTitle = new OriginalTitle(scrapeTitle().getTitle());
        document = originalDocument;
        return originalTitle;
    }
}

@Override
public SortTitle scrapeSortTitle() {
    return SortTitle.BLANK_SORTTITLE;
}

@Override
public Set scrapeSet() {
    return Set.BLANK_SET;
}

@Override
public Rating scrapeRating() {
    return Rating.BLANK_RATING;

}

@Override
public Year scrapeYear() {
    return scrapeReleaseDate().getYear();
}

@Override
public ReleaseDate scrapeReleaseDate()
{
    String releaseDate = "";
    Elements movieInfo = document.select("table#information_table > tbody > tr");
    System.out.println(movieInfo.size());
    for (Element row : movieInfo){
        if (row.children().first().text().equals("Date")){
            releaseDate = row.child(1).text();
        }
    }

    if(!releaseDate.equals(""))
    {
        //System.out.println("year = " + yearElement.text());
        String yearText = releaseDate.trim();
        if(yearText.length() > 4)
        {
            return new ReleaseDate(yearText);
        }

    }
    return ReleaseDate.BLANK_RELEASEDATE;
}

@Override
public Top250 scrapeTop250() {
    return Top250.BLANK_TOP250;
}

@Override
public Votes scrapeVotes() {
    return Votes.BLANK_VOTES;
}

@Override
public Outline scrapeOutline() {
    return Outline.BLANK_OUTLINE;
}

@Override
public Plot scrapePlot() {
    return Plot.BLANK_PLOT;
}

@Override
public Tagline scrapeTagline() {
    return Tagline.BLANK_TAGLINE;
}

@Override
public Runtime scrapeRuntime() {

    String duration = "";
    Elements movieInfo = document.select("table#information_table > tbody > tr");
    System.out.println(movieInfo.size());
    for (Element row : movieInfo){
        if (row.children().first().text().equals("Duration")){
            duration = row.child(1).text();
        }
    }
    if(!duration.equals(""))
    {
        return new Runtime(new Integer(duration.replace("min", "").trim()).toString());

    }
    return Runtime.BLANK_RUNTIME;
}

@Override
public Thumb[] scrapePosters() {
    ArrayList<Thumb> thumbList = new ArrayList<Thumb>();
    String scrapedId = scrapeID().getId();
    try {
            String potentialGalleryImageURL = "http://www.lesshin.com/contents/" + scrapedId.replace("n", "") + "/thum2.jpg";
            System.out.println(potentialGalleryImageURL);
            //String potentialGalleryPreviewImageURL = "http://en.heyzo.com/contents/3000/" + scrapedId + "/gallery/thumbnail_0" + String.format("%02d",i) + ".jpg";
            if(SiteParsingProfile.fileExistsAtURL(potentialGalleryImageURL))
            {
                Thumb thumbToAdd = new Thumb(potentialGalleryImageURL);
                //thumbToAdd.setPreviewURL(new URL(potentialGalleryPreviewImageURL));
                thumbList.add(thumbToAdd);

            }

        //image that is the preview of the trailer
        //Thumb trailerPreviewThumb = new Thumb("http://www.heyzo.com/contents/3000/" + scrapedId + "/images/player_thumbnail_450.jpg");
        //thumbList.add(trailerPreviewThumb);
    } catch (MalformedURLException e) {
        e.printStackTrace();
        return thumbList.toArray(new Thumb[thumbList.size()]);
    }
    // TODO Auto-generated method stub
    return thumbList.toArray(new Thumb[thumbList.size()]);
}

@Override
public Trailer scrapeTrailer(){
    String scrapedId = scrapeID().getId();
    String trailerURL = "http://www.lesshin.com/contents/" + scrapedId.replace("n", "") + "/sample.flv";
    if(SiteParsingProfile.fileExistsAtURL(trailerURL))
        return new Trailer(trailerURL);
    return Trailer.BLANK_TRAILER;
}

@Override
public Thumb[] scrapeFanart() {
    ArrayList<Thumb> thumbList = new ArrayList<Thumb>();
    String scrapedId = scrapeID().getId();
    try {
        //gallery links
        for(int i = 1; i <= 21; i++)
        {
            String potentialGalleryImageURL = "http://www.lesshin.com/contents/" + scrapedId.replace("n", "") + "/" + i + ".jpg";
            System.out.println(potentialGalleryImageURL);
            //String potentialGalleryPreviewImageURL = "http://en.heyzo.com/contents/3000/" + scrapedId + "/gallery/thumbnail_0" + String.format("%02d",i) + ".jpg";
            if(SiteParsingProfile.fileExistsAtURL(potentialGalleryImageURL))
            {
                Thumb thumbToAdd = new Thumb(potentialGalleryImageURL);
                //thumbToAdd.setPreviewURL(new URL(potentialGalleryPreviewImageURL));
                thumbList.add(thumbToAdd);

            }
        }
        //image that is the preview of the trailer
        //Thumb trailerPreviewThumb = new Thumb("http://www.heyzo.com/contents/3000/" + scrapedId + "/images/player_thumbnail_450.jpg");
        //thumbList.add(trailerPreviewThumb);
    } catch (MalformedURLException e) {
        e.printStackTrace();
        return thumbList.toArray(new Thumb[thumbList.size()]);
    }
    // TODO Auto-generated method stub
    return thumbList.toArray(new Thumb[thumbList.size()]);
}

@Override
public Thumb[] scrapeExtraFanart() {
    return new Thumb[]{};
}

@Override
public MPAARating scrapeMPAA() {
    return MPAARating.RATING_XXX;
}

@Override
public ID scrapeID() {
    //Just get the ID from the page URL by doing some string manipulation
    String baseUri = document.baseUri();
    if(baseUri.length() > 0 && baseUri.contains("lesshin.com"))
    {
        baseUri = baseUri.replaceFirst("/index.html", "");
        String idFromBaseUri = baseUri.substring(baseUri.lastIndexOf('/')+1);
        return new ID(idFromBaseUri);
    }
    return ID.BLANK_ID;
}

@Override
public ArrayList<Genre> scrapeGenres() {
    ArrayList<Genre> genreList = new ArrayList<Genre>();

    return genreList;
}

@Override
public ArrayList<Actor> scrapeActors() {

    ArrayList<Actor> actorList = new ArrayList<Actor>();
    Elements movieInfo = document.select("table#information_table > tbody > tr");
    System.out.println(movieInfo.size());
    for (Element row : movieInfo){
        if (row.children().first().text().equals("Model")){

            Elements actors = row.child(1).children();
            for (Element actor : actors){
                actorList.add(new Actor(actor.text(), "", null));
            }
        }
    }

    return actorList;
}

@Override
public ArrayList<Director> scrapeDirectors() {
    return new ArrayList<Director>();
}

@Override
public Studio scrapeStudio() {
    // TODO Auto-generated method stub
    return new Studio("LESSHIN");
}

@Override
public String createSearchString(File file) {
    scrapedMovieFile = file;
    String fileID = findIDTagFromFile(file).toLowerCase();

    if (fileID != null) {

        englishPage = "http://en.lesshin.com/moviepages/" + fileID + "/index.html";
        japanesePage = "http://www.lesshin.com/moviepages/" + fileID + "/index.html";
        try {
            japaneseDocument = Jsoup.connect(japanesePage).timeout(SiteParsingProfile.CONNECTION_TIMEOUT_VALUE).get();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        if(scrapingLanguage == Language.ENGLISH)
        {
            return englishPage;
        }
        else
        {
            return japanesePage;
        }
    }

    return null;
}

@Override
public SearchResult[] getSearchResults(String searchString)
        throws IOException {
    SearchResult searchResult = new SearchResult(searchString);
    SearchResult[] searchResultArray = {searchResult};
    return searchResultArray;
}

public static String findIDTagFromFile(File file) {
    return findIDTag(FilenameUtils.getName(file.getName()));
}

public static String findIDTag(String fileName) {
    Pattern pattern = Pattern.compile("(?:\\b|_)(n?[0-9]{3})(?:\\b|_)");
    Matcher matcher = pattern.matcher(fileName);
    if (matcher.find()) {
        String searchString = matcher.group(1);
        if (!searchString.startsWith("n")) searchString = "n" + searchString;
        return searchString;
    }
    return null;
}

@Override
public SiteParsingProfile newInstance() {
    return new LesshinParsingProfile();
}

}

DoctorD1501 commented 8 years ago

Any chance you can submit a pull request with just this file's changes?

DoctorD1501 / JAVMovieScraper

Contribute additional scrapers #131