Closed GoogleCodeExporter closed 8 years ago
I have submitted all types to the TIKA jira and all of them have been approved
which means that if we implement this issue, we won't lose any Media-Type.
More than that though, by implementing Tika's detection we will support any
future MediaTypes.
Original comment by avrah...@gmail.com
on 21 May 2014 at 4:37
Note: Don't forget to uncomment the relevant unit tests (the 2 tests are in
@Ignore state)
Original comment by avrah...@gmail.com
on 22 May 2014 at 6:56
I have a slight concern re adding a dependency on Tika, as that can be pretty
"heavy" in terms of jars pulled in. Do you know how many new dependencies we
get?
Original comment by kkrugler...@transpac.com
on 24 Jun 2014 at 3:04
The reason I used Tika is because it already exists in our project, I am not
sure how the class loader works, but I thought that it already loaded the jars
at this point (I would be glad if you could explain this point).
The dependency is only on Tika-core which is the lightweight version of
Tika-as-dependency
And our logic needs Tika - as it is really essential here. You can see that I
added two tests (bug #42) that fail our current code but should succeed with
Tika for example.
I suggest that if performance becomes an issue we will address it (and hardcode
the MIME signatures for example)
Original comment by avrah...@gmail.com
on 24 Jun 2014 at 4:33
This is the Maven repo page for tika-core:
http://mvnrepository.com/artifact/org.apache.tika/tika-core/1.5
It depends on 4 different artifacts (one of them is JUnit which we already
have), which have no follow-ups dependencies (they don't depend on any other
library, well, except for JUnit which has one dependency, but we already have
it in our project).
So these are the external dependencies which Tika will add:
Group Artifact Version
org.osgi org.osgi.core 4.0.0
org.osgi org.osgi.compendium 4.0.0
biz.aQute bndlib 1.43.0
Original comment by avrah...@gmail.com
on 24 Jun 2014 at 10:50
Is there a patch for this issue? Or is the patch the code posted on your
initial comment e.g remove a bunch and add one line for
MediaType.parse(contentType)?
Thanks for this issue.
@Ken, I do not think that the tika-core import is blockingas Avi has mentioned
it is a minimum import. Me can even consider limiting the three additional
dependencies as the MediaType code is self contined within tika-core/detect or
/mime module IIRC.
Original comment by lewis.mc...@gmail.com
on 1 Jul 2014 at 2:32
I didn't submit this patch yet as it was changing the Parser file which was
changed by several open issues.
I have the fixed code in my IDE and will submit the patch in the near future
(preferably, after the submission of issue39 which touches the same file)
Original comment by avrah...@gmail.com
on 1 Jul 2014 at 4:41
Will submit the patch after review of issue39 (touches the same file)
Original comment by avrah...@gmail.com
on 7 Jul 2014 at 1:59
Updated the Meta for this issue.
Original comment by avrah...@gmail.com
on 8 Jul 2014 at 5:01
Fixed and added the patch.
Please note that the fix is contained within SitemapParser
I have mentioned the TODO I left there as issue45
I have uncommented 2 @ignore unit tests which now that we have Tike support -
they pass.
I have also updated a small method (about priority) in SitemapURL, where I
updated the log severity and granularity.
Original comment by avrah...@gmail.com
on 11 Jul 2014 at 8:16
Attachments:
Tested locally and in example code I have... not problems from me.
+1... can you please wait until someone else has tried this out before we
commit?
Thanks Avi. Good job.
Original comment by lewis.mc...@gmail.com
on 11 Jul 2014 at 2:06
Of course!
Original comment by avrah...@gmail.com
on 11 Jul 2014 at 2:20
Original comment by avrah...@gmail.com
on 12 Jul 2014 at 8:21
From what I can tell, with this patch SiteMapParser is not thread-safe. I don't
know if it was previously, just an observation that this might be a regression.
Also, in SiteMapURL it's better to explicitly test for a null priorityStr,
versus relying on handling the NullPointerException.
Original comment by kkrugler...@transpac.com
on 14 Jul 2014 at 11:58
Synchronized the initializing method for null safety.
Checking for null (or empty) instead of relying on NPE for the code flow.
The Patch is attached
Original comment by avrah...@gmail.com
on 15 Jul 2014 at 7:27
Attachments:
A few things :
Let's not break the compatibility of the API with the change to the signature
for the method processText :
- private SiteMap processText(byte[] content, String sitemapUrl) throws
IOException {
-
+ private SiteMap processText(String sitemapUrl, byte[] content) throws
IOException {
It does not add anything to revert the order of the arguments.
The following object
private AutoDetectParser autoDetectParser;
is not used outside the method initializeMediaTypeComponents(), it does not
need to be a field
What about using static fields for
+ private MediaTypeRegistry mediaTypeRegistry;
+ private final List<MediaType> XML_MEDIA_TYPES = new ArrayList<MediaType>();
+ final List<MediaType> TEXT_MEDIA_TYPES = new ArrayList<MediaType>();
+ final List<MediaType> GZ_MEDIA_TYPES = new ArrayList<MediaType>();
and have a static initializer instead of initializeMediaTypeComponents()? This
way we'd be sure it's been done only once - not only for the instance but for
the whole class.
Thanks
Original comment by digitalpebble
on 16 Jul 2014 at 11:09
My Comments are inline
**********
We have 3 process methods (text, xml, gz) the other two have the byte array
as their -second- argument.
So because it is a private method with only one caller, I changed it to
have also the byte array as the second argument - nobody else should use it
(private), so I don't think it will cause any trouble.
**********
Just to understand what you are suggesting:
1. Have those Lists - static
2. Have the AutoDetectParser - static
3. Call the initialization of the above in the constructor
4. Remove the "synchronized" from the initializing method
Correct ?
**********
Original comment by avrah...@gmail.com
on 16 Jul 2014 at 11:32
Hi,
I hadn't noticed it was private, the change you suggested is fine.
1. Have those Lists - static => yes
2. Have the AutoDetectParser - static => no, it does not need to be exposed as
a field at all
3. Call the initialization of the above in the constructor => no. We don't call
it, we' do the whole thing in a static initializer block
(http://docs.oracle.com/javase/tutorial/java/javaOO/initial.html)
4. Remove the "synchronized" from the initializing method => well there won't
be an initializing method, it will be part of the static initializer
Original comment by digitalpebble
on 16 Jul 2014 at 11:52
Thank you Julien, I will look into it tonight.
Avi.
Original comment by avrah...@gmail.com
on 16 Jul 2014 at 12:00
Removed the AutoDetectParser
Staticized the lists and MediaTypeRegistry - and initialized them staticly
Original comment by avrah...@gmail.com
on 17 Jul 2014 at 6:51
Attachments:
Original comment by avrah...@gmail.com
on 18 Jul 2014 at 8:05
Hi,
Why don't you put the content of initMediaTypes() in the static initializer
block?
String contentType = new Tika().detect(onlineSitemapUrl);
=> detecting the mime type on the URL only is probably not very reliable. Use
the method detect(InputStream stream, String name) instead which takes both the
binary content and the file name
src/main/java/crawlercommons/sitemaps/SiteMapURL.java => are these changes
relevant for this issue? If not let's track these separately
Thanks
Original comment by digitalpebble
on 21 Jul 2014 at 7:47
--Why don't you put the content of initMediaTypes() in the static
initializer block?--
I have followed your link:
http://docs.oracle.com/javase/tutorial/java/javaOO/initial.html
Where I found that having a private static methods is the same as the
static init blocks, and it looks cleaner to me to put the one time
initialization separately at the bottom of the class:
[From the Oracle documentation from your link]: "There is an alternative to
static blocks — you can write a private static method: ..."
--String contentType = new Tika().detect(onlineSitemapUrl);
=> detecting the mime type on the URL only is probably not very reliable.
Use the method detect(InputStream stream, String name) instead which takes
both the binary content and the file name--
This one is taken care of in issue47, please note comment #5 there.
I did change that line though because it sometimes gives wrong detection to
use the byte array as the input, when sending the URL as input, the
detection is good (like sending a stream and filename) but the performance
is bad, the performance issue is fixed in issue47.
--src/main/java/crawlercommons/sitemaps/SiteMapURL.java => are these
changes relevant for this issue? If not let's track these separately--
I admit that these changes aren't relevant to this issue, but they are
hardly worth of a separate issue, I can open a new issue though if you
think it is worth it.
In that file I put a slight upgrade to the logging.
Original comment by avrah...@gmail.com
on 21 Jul 2014 at 8:13
Let's instanciate the Tika instance only once and reuse it - otherwise we have
to reload the Tika config everytime which is definitely not needed.
Original comment by digitalpebble
on 1 Aug 2014 at 3:17
Done.
Patch Committed
Original comment by avrah...@gmail.com
on 6 Aug 2014 at 7:09
Hi Avi - one thing that's useful (not that I'm good about this) is to include
the SVN revision number or Git SHA1 in the comment, so that it's easier to
trace back.
Original comment by kennethk...@gmail.com
on 6 Aug 2014 at 7:55
Committed at SVN Revision #133
Original comment by avrah...@gmail.com
on 7 Aug 2014 at 7:44
Original issue reported on code.google.com by
avrah...@gmail.com
on 26 Apr 2014 at 7:52