amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

[Sitemaps] Add Tika MediaType Support #40

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Change the Mime type parsing to use Tika's MediaType.

So instead of this code:
if (url.getPath().endsWith(".xml") || contentType.contains("text/xml") || 
contentType.contains("application/xml") || 
contentType.contains("application/x-xml")
                        || contentType.contains("application/atom+xml") || contentType.contains("application/rss+xml")) {

            // Try parsing the XML which could be in a number of formats
            return processXml(url, content);
        } else if (url.getPath().endsWith(".txt") || contentType.contains("text/plain")) {
            // plain text
            return (AbstractSiteMap) processText(content, url.toString());
        } else if (url.getPath().endsWith(".gz") || contentType.contains("application/gzip") || contentType.contains("application/x-gzip") || contentType.contains("application/x-gunzip")
                        || contentType.contains("application/gzipped") || contentType.contains("application/gzip-compressed") || contentType.contains("application/x-compress")
                        || contentType.contains("gzip/document") || contentType.contains("application/octet-stream")) {
            return processGzip(url, content);
        }

I want to Identify the mediaType:
MediaType mediaType = MediaType.parse(contentType);

And then to process as follows:
1. By recursing through the mediatype supertypes till we get to the root and 
compare to the XML media type (or others)
2. If not found we should check the Aliases (for example text/xml is an alias 
of application/xml which is the more accurate form)
3. If not found then it is a bad MediaType and the exception should be thrown.

In this issue I will remove the recognition of unknown-to-tika media types 
(Although, I will submit them to the Tika Jira and we might automatically 
identify them in the future with future Tika library updates ).

This Issue will actually change the media recognition from our own String 
recognition to Tika recognition of MediaTypes

Original issue reported on code.google.com by avrah...@gmail.com on 26 Apr 2014 at 7:52

GoogleCodeExporter commented 8 years ago
I have submitted all types to the TIKA jira and all of them have been approved 
which means that if we implement this issue, we won't lose any Media-Type.

More than that though, by implementing Tika's detection we will support any 
future MediaTypes.

Original comment by avrah...@gmail.com on 21 May 2014 at 4:37

GoogleCodeExporter commented 8 years ago
Note: Don't forget to uncomment the relevant unit tests (the 2 tests are in 
@Ignore state)

Original comment by avrah...@gmail.com on 22 May 2014 at 6:56

GoogleCodeExporter commented 8 years ago
I have a slight concern re adding a dependency on Tika, as that can be pretty 
"heavy" in terms of jars pulled in. Do you know how many new dependencies we 
get?

Original comment by kkrugler...@transpac.com on 24 Jun 2014 at 3:04

GoogleCodeExporter commented 8 years ago
The reason I used Tika is because it already exists in our project, I am not 
sure how the class loader works, but I thought that it already loaded the jars 
at this point (I would be glad if you could explain this point).

The dependency is only on Tika-core which is the lightweight version of 
Tika-as-dependency

And our logic needs Tika - as it is really essential here. You can see that I 
added two tests (bug #42) that fail our current code but should succeed with 
Tika for example.

I suggest that if performance becomes an issue we will address it (and hardcode 
the MIME signatures for example)

Original comment by avrah...@gmail.com on 24 Jun 2014 at 4:33

GoogleCodeExporter commented 8 years ago
This is the Maven repo page for tika-core:
http://mvnrepository.com/artifact/org.apache.tika/tika-core/1.5

It depends on 4 different artifacts (one of them is JUnit which we already 
have), which have no follow-ups dependencies (they don't depend on any other 
library, well, except for JUnit which has one dependency, but we already have 
it in our project).

So these are the external dependencies which Tika will add:

Group           Artifact            Version
org.osgi    org.osgi.core           4.0.0
org.osgi    org.osgi.compendium 4.0.0
biz.aQute   bndlib                  1.43.0

Original comment by avrah...@gmail.com on 24 Jun 2014 at 10:50

GoogleCodeExporter commented 8 years ago
Is there a patch for this issue? Or is the patch the code posted on your 
initial comment e.g remove a bunch and add one line for 
MediaType.parse(contentType)? 
Thanks for this issue.
@Ken, I do not think that the tika-core import is blockingas Avi has mentioned 
it is a minimum import. Me can even consider limiting the three additional 
dependencies as the MediaType code is self contined within tika-core/detect or 
/mime module IIRC.

Original comment by lewis.mc...@gmail.com on 1 Jul 2014 at 2:32

GoogleCodeExporter commented 8 years ago
I didn't submit this patch yet as it was changing the Parser file which was 
changed by several open issues.

I have the fixed code in my IDE and will submit the patch in the near future 
(preferably, after the submission of issue39 which touches the same file)

Original comment by avrah...@gmail.com on 1 Jul 2014 at 4:41

GoogleCodeExporter commented 8 years ago
Will submit the patch after review of issue39 (touches the same file)

Original comment by avrah...@gmail.com on 7 Jul 2014 at 1:59

GoogleCodeExporter commented 8 years ago
Updated the Meta for this issue.

Original comment by avrah...@gmail.com on 8 Jul 2014 at 5:01

GoogleCodeExporter commented 8 years ago
Fixed and added the patch.

Please note that the fix is contained within SitemapParser
I have mentioned the TODO I left there as issue45

I have uncommented 2 @ignore unit tests which now that we have Tike support - 
they pass.

I have also updated a small method (about priority) in SitemapURL, where I 
updated the log severity and granularity.

Original comment by avrah...@gmail.com on 11 Jul 2014 at 8:16

Attachments:

GoogleCodeExporter commented 8 years ago
Tested locally and in example code I have... not problems from me.
+1... can you please wait until someone else has tried this out before we 
commit?
Thanks Avi. Good job.

Original comment by lewis.mc...@gmail.com on 11 Jul 2014 at 2:06

GoogleCodeExporter commented 8 years ago
Of course!

Original comment by avrah...@gmail.com on 11 Jul 2014 at 2:20

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 12 Jul 2014 at 8:21

GoogleCodeExporter commented 8 years ago
From what I can tell, with this patch SiteMapParser is not thread-safe. I don't 
know if it was previously, just an observation that this might be a regression.

Also, in SiteMapURL it's better to explicitly test for a null priorityStr, 
versus relying on handling the NullPointerException. 

Original comment by kkrugler...@transpac.com on 14 Jul 2014 at 11:58

GoogleCodeExporter commented 8 years ago
Synchronized the initializing method for null safety.

Checking for null (or empty) instead of relying on NPE for the code flow.

The Patch is attached

Original comment by avrah...@gmail.com on 15 Jul 2014 at 7:27

Attachments:

GoogleCodeExporter commented 8 years ago
A few things : 

Let's not break the compatibility of the API with the change to the signature 
for the method processText : 

-    private SiteMap processText(byte[] content, String sitemapUrl) throws 
IOException {
-
+    private SiteMap processText(String sitemapUrl, byte[] content) throws 
IOException {

It does not add anything to revert the order of the arguments.

The following object 

private AutoDetectParser autoDetectParser;

is not used outside the method initializeMediaTypeComponents(), it does not 
need to be a field

What about using static fields for 

+    private MediaTypeRegistry mediaTypeRegistry;
+    private final List<MediaType> XML_MEDIA_TYPES = new ArrayList<MediaType>();
+    final List<MediaType> TEXT_MEDIA_TYPES = new ArrayList<MediaType>();
+    final List<MediaType> GZ_MEDIA_TYPES = new ArrayList<MediaType>();

and have a static initializer instead of initializeMediaTypeComponents()? This 
way we'd be sure it's been done only once - not only for the instance but for 
the whole class.

Thanks

Original comment by digitalpebble on 16 Jul 2014 at 11:09

GoogleCodeExporter commented 8 years ago
My Comments are inline

**********
We have 3 process methods (text, xml, gz) the other two have the byte array
as their -second- argument.
So because it is a private method with only one caller, I changed it to
have also the byte array as the second argument - nobody else should use it
(private), so I don't think it will cause any trouble.
**********

Just to understand what you are suggesting:
1. Have those Lists - static
2. Have the AutoDetectParser - static
3. Call the initialization of the above in the constructor
4. Remove the "synchronized" from the initializing method

Correct ?
**********

Original comment by avrah...@gmail.com on 16 Jul 2014 at 11:32

GoogleCodeExporter commented 8 years ago
Hi,

I hadn't noticed it was private, the change you suggested is fine.

1. Have those Lists - static => yes
2. Have the AutoDetectParser - static => no, it does not need to be exposed as 
a field at all
3. Call the initialization of the above in the constructor => no. We don't call 
it, we' do the whole thing in a static initializer block 
(http://docs.oracle.com/javase/tutorial/java/javaOO/initial.html)
4. Remove the "synchronized" from the initializing method => well there won't 
be an initializing method, it will be part of the static initializer

Original comment by digitalpebble on 16 Jul 2014 at 11:52

GoogleCodeExporter commented 8 years ago
Thank you Julien, I will look into it tonight.

Avi.

Original comment by avrah...@gmail.com on 16 Jul 2014 at 12:00

GoogleCodeExporter commented 8 years ago
Removed the AutoDetectParser
Staticized the lists and MediaTypeRegistry - and initialized them staticly

Original comment by avrah...@gmail.com on 17 Jul 2014 at 6:51

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Jul 2014 at 8:05

GoogleCodeExporter commented 8 years ago
Hi,

Why don't you put the content of initMediaTypes() in the static initializer 
block? 

String contentType = new Tika().detect(onlineSitemapUrl);
=> detecting the mime type on the URL only is probably not very reliable. Use 
the method detect(InputStream stream, String name) instead which takes both the 
binary content and the file name 

src/main/java/crawlercommons/sitemaps/SiteMapURL.java => are these changes 
relevant for this issue? If not let's track these separately

Thanks

Original comment by digitalpebble on 21 Jul 2014 at 7:47

GoogleCodeExporter commented 8 years ago
--Why don't you put the content of initMediaTypes() in the static
initializer block?--

I have followed your link:
http://docs.oracle.com/javase/tutorial/java/javaOO/initial.html

Where I found that having a private static methods is the same as the
static init blocks, and it looks cleaner to me to put the one time
initialization separately at the bottom of the class:
[From the Oracle documentation from your link]: "There is an alternative to
static blocks — you can write a private static method: ..."

--String contentType = new Tika().detect(onlineSitemapUrl);
=> detecting the mime type on the URL only is probably not very reliable.
Use the method detect(InputStream stream, String name) instead which takes
both the binary content and the file name--

This one is taken care of in issue47, please note comment #5 there.
I did change that line though because it sometimes gives wrong detection to
use the byte array as the input, when sending the URL as input, the
detection is good (like sending a stream and filename) but the performance
is bad, the performance issue is fixed in issue47.

--src/main/java/crawlercommons/sitemaps/SiteMapURL.java => are these
changes relevant for this issue? If not let's track these separately--

I admit that these changes aren't relevant to this issue, but they are
hardly worth of a separate issue, I can open a new issue though if you
think it is worth it.
In that file I put a slight upgrade to the logging.

Original comment by avrah...@gmail.com on 21 Jul 2014 at 8:13

GoogleCodeExporter commented 8 years ago
Let's instanciate the Tika instance only once and reuse it - otherwise we have 
to reload the Tika config everytime which is definitely not needed.

Original comment by digitalpebble on 1 Aug 2014 at 3:17

GoogleCodeExporter commented 8 years ago
Done.

Patch Committed

Original comment by avrah...@gmail.com on 6 Aug 2014 at 7:09

GoogleCodeExporter commented 8 years ago
Hi Avi - one thing that's useful (not that I'm good about this) is to include 
the SVN revision number or Git SHA1 in the comment, so that it's easier to 
trace back.

Original comment by kennethk...@gmail.com on 6 Aug 2014 at 7:55

GoogleCodeExporter commented 8 years ago
Committed at SVN Revision #133

Original comment by avrah...@gmail.com on 7 Aug 2014 at 7:44