Attaching and Normalising Problems in NVivo12

gdzqzzx commented 3 years ago

Hi Johnathan @jschultz ,

First of all, thank you so much for developing this fantastic tool to free data from NVivo.

I am currently trying to extract data from NVivo12 (Windows) for my team. Though nvivotools hasn't supported NVivo12 yet, it seems that current mssqlAttach.bat still works in NVivo12 based on #9 . It's not bothering for me to query MS SQL Server directly by python or any other db management tools.

So here is my dummy question: How to attach npv file to SQL Server manually?

My feeble programming skills can only make me identify the key points are in mssqkTools.py and mssqlAttach.bat, but I'm not sure how exactly nvp file is attached to MS SQL Server. I have tried to attach nvp file directly to MS SQL Server, and got following error:

Msg 5120, Level 16, State 101, Server XXXXXXX\SQLEXPRESS, Line 1 Unable to open the physical file "XXXX.nvp". Operating system error 5: "5(Access is denied.)". Msg 1802, Level 16, State 7, Server XXXXXXX\SQLEXPRESS, Line 1 CREATE DATABASE failed. Some file names listed could not be created. Check related errors.

Do I need to convert the nvp file into some other format before attaching it to MS SQL Server?

jschultz commented 3 years ago

Glad you like it @gdzqzzx !

In answer to your questions, attaching the NVivo file to MS SQL Server shouldn't be difficult and you don't need to convert it into another format since .nvp files are in fact just renamed .mdf files. Although you could access the file directly through SQL Server, you would have to deal with NVivo's opaque data structures and miss out on all the hard work I did reverse engineering them!

Based on the error message it seems the SQL Server is unable to access the NVivo file. If you look at mssqlAttach.bat you'll see that it doesn't attach the file directly but makes a temporary copy of it into the directory %TMP%\SQL and (in theory) enables full access for all users to that temporary copy. Are you able to locate the file there and check the relevant permissions as discussed on this page: https://stackoverflow.com/questions/18286765/sql-server-operating-system-error-5-5access-is-denied?

Let me know how it goes and I'll try to work out a fix to the scripts.

gdzqzzx commented 3 years ago

Thanks for your quick response! I will try it and get back to you soon.

gdzqzzx commented 3 years ago

After a whole day testing, I finally attached my nvp file to MS SQL Server 2017.

For those who also want to attach by themselves, here are my experience:

Install MS SQL Server 2017 (Basic) + Management Studio (18);
Copy nvp file to a temporary folder that all users can access. Create a new folder >> Right click the folder >> Click properties >> Click security >> just give FULL CONTROL to all the users/groups you can see!! (Dangerous but convenient)
Run management studio as administrator, then you should be able to ATTACH the nvp file to SQL Server.

Note1: If you still can't attach your nvp, it might be still because of authentication problem, add a new group including all users under current account, and give this group full control permission.

Note2: If you can attach your nvp, but the data is shown not found, try to open this file in your NVivo once. Then you can find that the detected database name changes and data can be found.

Note3: I am not sure why I can't start my SQL Server Service after I changed TCP port to 1433. So I have to keep using the dynamic port 2851, which can be connected successfully to my DBeaver.

gdzqzzx commented 3 years ago

Another annoying issue is that I got is grabled columns in dbo.Item and dbo.Project table. It seems that attributes in dbo.Item are all important, it might be something relate to encoding because I am using a Chinese system, and I still have no idea to fix it.

jschultz commented 3 years ago

Those issues with file and directory permissions should be handled by the helper scripts in NVivotools. It would be good to work out why they aren't.

As for the mangled Name and Description fields, this is fairly simple. NVivo just offsets each character but some fixed value. I forget what it was but if you look in the NVivotools code you'll find it.

But what you probably want to do now is to use NVivo tools to process your NVivo file using the ~~DenormaliseNVP.py~~NormaliseNVP.py script.

gdzqzzx commented 3 years ago

I have tried to use NVivotools directly. But first failed in SQL Server setting and then not sure what parameter I should set. I will try it again and see how it goes.

jschultz commented 3 years ago

@gdzqzzx Did you mean to close the issue?

gdzqzzx commented 3 years ago

Yes, because the problem has been solved and actually it's not that relevant to NVivo improvement.

I think it might be good to open another issue if I have any other problems when using NVivotools -- of course, feel free to keep it opened if you want.

jschultz commented 3 years ago

Did you work out why the helper scripts didn't fix the file access problems?

gdzqzzx commented 3 years ago

Yes, but not all of it, to some extent.

What I fixed 2 days ago is attahing nvp file to SQL Server manually without using NVivotools. Before that, I tried mssqlAttach.py as python mssqlAttach.py -P 2851 -i EXPRESS "xxx.nvp" ntool_test and got following error:

Traceback (most recent call last):
  File "mssqlAttach.py", line 54, in <module>
    mssqlAttach(None)
  File "mssqlAttach.py", line 49, in mssqlAttach
    verbosity = args.verbosity)
  File "...\mssqlTools.py", line 74, in __init__
    raise RuntimeError('No suitable SQL self.server self.instance found')
RuntimeError: No suitable SQL self.server self.instance found

And I just tried running mssqlAttach.bat directly as mssqlAttach.bat "xxx.nvp" ntool_test SQLEXPRESS. It works fine this time.

But some import error occured when I ran DenormaliseDB.py, I will open a new issue for that.

gdzqzzx commented 3 years ago

I have tested mount function in NVivo.py. It seems that the instance variable somehow wasn't passed correctly into mssqlTools initialization. (Of course, the SQL Server version has been fixed as 'MSSQL14')

I got RuntimeError when I used NVivo.mount(), while mssqlAPI.attach() worked fine separately.

@jschultz Please let me know if you want to do more testing on this issue.

(NormaliseDB.py hasn't worked for me yet, but it seems that it's because I haven't set up the output DB properly. I will test it later.)

jschultz commented 3 years ago

I see. The way that the code can end up producing that error message is if the break command two lines earlier does not execute. It looks like instanceversion and version both contain the value MSSQL14 but maybe there is some reason why the test for equality if (not version or instanceversion == version): is failing. Perhaps you could investigate what is happening there?

I also note that your line numbers differ from mine by one, ie the RuntimeError is on line 74 for me. Is it possible you have inserted a line somewhere that is somehow messing things up?

Regarding NormaliseDB.py, while I appreciate your testing all this code, it should still be simpler for you to use NormaliseNVP.py which handles all the database mounting etc. and produces a SQLite output file.

gdzqzzx commented 3 years ago

I remember I switched to NormaliseDB.py because I stuck in NormaliseNVP.py, but anyway, now I made it run now and got following outpus and error (line 120 in NVivoNVP.py was edited as version = ('MSSQL12' if args.nvivoversion == '11' else 'MSSQL14')):

Using MSSQL instance: SQLEXPRESS
Using port: 2851
Found SQL server instance SQLEXPRESS  version MSSQL14
Using MSSQL instance: SQLEXPRESS
Using port: SQLEXPRESS
Attached database nvivo18284
Normalising users
Normalising project
Normalising node categories
Normalising nodes
Normalising node attributes
Normalising source categories
Normalising sources
Traceback (most recent call last):
  File "...\sqlalchemy\sql\base.py", line 1156, in __getattr__
    return self._index[key]
KeyError: 'ThumbnailLocation'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "NormaliseNVP.py", line 142, in <module>
    NormaliseNVP(None)
  File "NormaliseNVP.py", line 131, in NormaliseNVP
    NVivo.Normalise(args)
  File "...\NVivo.py", line 664, in Normalise
    nvivoSource.c.ThumbnailLocation if args.nvivoversion == '12' else nvivoSource.c.Thumbnail,
  File "...\sqlalchemy\sql\base.py", line 1158, in __getattr__
    util.raise_(AttributeError(key), replace_context=err)
  File "...\sqlalchemy\util\compat.py", line 211, in raise_
    raise exception
AttributeError: ThumbnailLocation

Everything works properly before normalizing Source table. It seems that NVivo12 still uses Tumbnail and Waveform as column names. It may worth double-checking the column names in NVivo12 with this file.

Moreover, it might be good to add NVivo12 into nvivoversion choice in all scripts.

jschultz commented 3 years ago

This is the kind of thing that keeps on coming up. At some point I must have found that NVivo version 12 used ThumbnailLocation and WaveformLocation, but clearly you are finding otherwise. Those changes were back in 2018 so maybe a more recent update of NVivo changed the column names. Or maybe it is different between Mac and Windows versions. Because NVivo is so opaque you never really know what is going on.

I'd say the best solution would be to test for the presence of the relevant columns rather than use the NVivo version to decide which one to use. If you are able to do this that would be great but otherwise you can just hack the column names to make it work for you.

And yes I agree completely that NVivo12 support should be included throughout. Including denormalisation, ie converting into NVivo format. I just need to find the time, or someone else to do it. :)

gdzqzzx commented 3 years ago

I think detecting the relevant columns is a good idea. I will try if I can make it after I make your original script work.

I will keep updating my progress in this issue.

gdzqzzx commented 3 years ago

I am happy to announce that I have successfully normalized the nvp file with NormaliseNVP.py.

Besides the conflict on ThumbnailLocation and WaveformLocation, I found that there is a bug in NVivo.py. @jschultz

From line 904 to line 910, there should be parentheses for IF ELSE clause as follow:

(
    [nvivoAnnotation.c.StartText, nvivoAnnotation.c.LengthText]
    if args.mac
    else [nvivoAnnotation.c.StartX, nvivoAnnotation.c.LengthX]
)

Otherwise the ELSE condition will ignore the former columns and return StartXand LengthXonly.

The last thing I want to confirm is whether the condition nvivoNodeReference.c.StartZ.is_(None) of taggings extraction is still correct. I happened to found that

nvivodb.execute(
    select(
        [
            nvivoNodeReference.c.Id,
            nvivoNodeReference.c.Source_Item_Id.label("Source"),
            nvivoNodeReference.c.Node_Item_Id.label("Node"),
            nvivoNodeReference.c.StartX,
            nvivoNodeReference.c.LengthX,
            nvivoNodeReference.c.StartY,
            nvivoNodeReference.c.LengthY,
            nvivoNodeReference.c.StartZ,
            nvivoNodeReference.c.CreatedBy,
            nvivoNodeReference.c.CreatedDate,
            nvivoNodeReference.c.ModifiedBy,
            nvivoNodeReference.c.ModifiedDate,
        ]
        + (
            [nvivoNodeReference.c.StartText, nvivoNodeReference.c.LengthText]
            if args.mac
            else []
        )
    ).where(
        and_(
            # nvivoNodeReference.c.ReferenceTypeId == literal_column('0'),
            nvivoItem.c.Id == nvivoNodeReference.c.Node_Item_Id,
            nvivoItem.c.TypeId == literal_column(NVivo.ItemType.Node),
            nvivoNodeReference.c.StartZ.is_(None),
        )
    )
)

returned nothing, but if I remove nvivoNodeReference.c.StartZ.is_(None) the query have some returns and all of them have StartZ = 8. Maybe there's no taggings data in my nvp file, but I am still curious what is this NodeReference.StartZ for？

jschultz commented 3 years ago

Fantastic! I'm rather busy at the moment but will endeavour to look more closely at these issues at some point. Are you able to make a pull request for the bug fix? And perhaps check that the same mistake (ie missing parentheses) doesn't appear at other points in the code? As for what StartZ means I really have no idea. Presumably some kind of three dimensional data. Are you able in some way to work out what those NodeReference records with StartZ equal to 8 correspond to in your data? Or if you have a data file that doesn't contain confidential information could you share it?

gdzqzzx commented 3 years ago

Yes, I am happy to check the potential bug in other files later. As for the sample data file, it's a pity that I haven't have any desensitization data yet, I will try to figure it out when I have time.

My initial thought is extracting all the codes with corresponding source texts into a plain file like json or csv. For exmple,

{"Source": "text A", "Codes": ["code A", "Code B"]}

I presumed that I need Source table to implement this idea, but it seems that the number of row in Source table doesn't have any relationship with the original coding files. Also, I found that the decompressed results of Thumbnail and Object columns in Source table is still not readable.

Does those two columns need further processing for decoding? Or maybe I don't need these columns but need some other tables to implement my data extraction?

jschultz commented 3 years ago

The Thumbnail column is just copied straight from the NVivo column of the same name. I would presume it's some kind of image format (.png, .jpg?) but I don't really know much about it. The Object column is in a format determined by the column ObjectType. The possible values are DOC,ODT, MP3, JPEG and PDF. But if what you are after is a textual representation of the source content then look in the Content column. Producing output as you describe should be pretty simple using Python scripts, or even just SQL.

gdzqzzx commented 3 years ago

Yes, that's right. I was confused by the values in Nodes and thought maybe I miss something, but now I find that the norm file should contain all the things I need.

Thanks for your explanation :)

gdzqzzx commented 3 years ago

I am happy to announce that I have successfully normalized the nvp file with NormaliseNVP.py.

Besides the conflict on ThumbnailLocation and WaveformLocation, I found that there is a bug in NVivo.py. @jschultz

From line 904 to line 910, there should be parentheses for IF ELSE clause as follow:
(
    [nvivoAnnotation.c.StartText, nvivoAnnotation.c.LengthText]
    if args.mac
    else [nvivoAnnotation.c.StartX, nvivoAnnotation.c.LengthX]
)
Otherwise the ELSE condition will ignore the former columns and return StartXand LengthXonly.

The last thing I want to confirm is whether the condition nvivoNodeReference.c.StartZ.is_(None) of taggings extraction is still correct. I happened to found that
nvivodb.execute(
    select(
        [
            nvivoNodeReference.c.Id,
            nvivoNodeReference.c.Source_Item_Id.label("Source"),
            nvivoNodeReference.c.Node_Item_Id.label("Node"),
            nvivoNodeReference.c.StartX,
            nvivoNodeReference.c.LengthX,
            nvivoNodeReference.c.StartY,
            nvivoNodeReference.c.LengthY,
            nvivoNodeReference.c.StartZ,
            nvivoNodeReference.c.CreatedBy,
            nvivoNodeReference.c.CreatedDate,
            nvivoNodeReference.c.ModifiedBy,
            nvivoNodeReference.c.ModifiedDate,
        ]
        + (
            [nvivoNodeReference.c.StartText, nvivoNodeReference.c.LengthText]
            if args.mac
            else []
        )
    ).where(
        and_(
            # nvivoNodeReference.c.ReferenceTypeId == literal_column('0'),
            nvivoItem.c.Id == nvivoNodeReference.c.Node_Item_Id,
            nvivoItem.c.TypeId == literal_column(NVivo.ItemType.Node),
            nvivoNodeReference.c.StartZ.is_(None),
        )
    )
)
returned nothing, but if I remove nvivoNodeReference.c.StartZ.is_(None) the query have some returns and all of them have StartZ = 8. Maybe there's no taggings data in my nvp file, but I am still curious what is this NodeReference.StartZ for？

Update on NodeReference.StartZ issue: In my nvp file, I found that the StartZ in Annotation and NodeReference(Tagging) are all equal to 8. Regarding there is no StartZ filtering in Annotation querying, I presume that nvivoNodeReference.c.StartZ.is_(None) in NodeReference(Tagging) should be removed, otherwise the most important coding information would be filtered (in my version).
I also found that for Tagging.Fragment, StartY is 1-based in NVivo file, while StartX is 0-based, which means when normalizing my version of nvp file, item['Fragment'] += ',' + str(item['StartY']+1) should be item['Fragment'] += ',' + str(item['StartY']) in def build_tagging_or_annotation(item) (NVivo.py)

I am using NVivo Windows 12.6.0.959.

jschultz commented 3 years ago

Thanks @gdzqzzx and sorry I don't have much time right now but I would like to incorporate your work when I get a chance. Is there any chance you could make one or more pull requests reflecting the issues you have found. If you're unsure how to do that there are instructions here. If you don't have the time could you at least commit the changes you have made to your fork of the repo so I can see them all? Thanks!

gdzqzzx commented 3 years ago

Yes, I am happy I am able to improve this fabulous tool! I will try to create my pull requests later this week (sorry I can't do it instantly because I don't have enough time currently and it would take me some time to figure out how to do the pulling process).

gdzqzzx commented 3 years ago

Hi @jschultz sorry for the delay, I have made the pull request ready for review (hope I did it correctly). Please let me know if there's any other things I can help.

BarraQDA / nvivotools

Attaching and Normalising Problems in NVivo12 #10