ababaian / LIONS

LIONS is a bioinformatic analysis pipeline which brings together a few pieces of software and some home-brewed scripts to annotate a paired-end RNAseq library to detect TE-intiated transcripts
GNU General Public License v3.0
27 stars 13 forks source link

Issue with East LIONS #25

Open smiriyala3 opened 2 years ago

smiriyala3 commented 2 years ago

Hello Authors,

I seem to be running into the following error when running East LIONS:

Loading chromosome lengths from /users/smiriyal/data/DDX4/LIONS/resources/hg38/genome/hg38.chr.size Lengths for 455 chromosomes are loaded load /users/smiriyal/data/DDX4/LIONS/projects/GSC_NSC/HNSC1_1/tmp_uniq_repeat_coords 0 regions loaded from file Exception in thread "main" java.lang.NullPointerException at java.util.TreeMap.compare(TreeMap.java:1294) at java.util.TreeMap.put(TreeMap.java:538) at java.util.TreeSet.add(TreeSet.java:255) at src.tools.RegionsCoverageFromWigCalculator.execute(RegionsCoverageFromWigCalculator.java:255) at src.tools.RegionsCoverageFromWigCalculator.main(RegionsCoverageFromWigCalculator.java:55)

Do you have any ideas on how to resolve this?

Thanks!

ababaian commented 2 years ago

is the hg38.chr.size file empty?

smiriyala3 commented 2 years ago

Do you know where I can find this file? I couldn't find it mentioned in the user guide.

ababaian commented 2 years ago

Sorry, been a really busy last few weeks. To generate this file you use the first columns of a faidx file of your genome. This should be auto-generated in the "Create Resources" step.

See: https://github.com/ababaian/LIONS/blob/595bc6fc2a8f6d2f31ce121d63a9d3a7a812af6e/scripts/Initialize/initializeRes.sh#L69

# Generate res.chr.size file
cut -f1,2 $INDEX.fa.fai > $INDEX.chr.size

Does your system have samtools installed? You may need to re-generate your resource files if it was made with an error. The auto-checks only ensure the file exists, it does not check that the file is valid.

smiriyala3 commented 2 years ago

Thanks for the response in your busy time. Really appreciate it. I believe I have hg38.chr.size file and this chr.size. file contains information (also I have samtools installed as well), but the Eastline still doesn't generate tmp_uniq_repeat_coords, as the error suggested below. I don't see this temp_uniq_repeat_coords in /users/smiriyal/data/DDX4/LIONS/projects/GSC_NSC/HNSC1_1/. Given the name, I suppose "tmp_uniq_repeat_coords" file is temporarily created, but I am not sure how eastLion.sh create uniq repeat records. Note that, in resources/hg38/repeat folder, I have rm_hg38.ucsc as well. I would appreciate if you could guide me how to address creating uniq_repeat_coords.

Loading chromosome lengths from /users/smiriyal/data/DDX4/LIONS/resources/hg38/genome/hg38.chr.size Lengths for 455 chromosomes are loaded load /users/smiriyal/data/DDX4/LIONS/projects/GSC_NSC/HNSC1_1/tmp_uniq_repeat_coords 0 regions loaded from file Exception in thread "main" java.lang.NullPointerException at java.util.TreeMap.compare(TreeMap.java:1294) at java.util.TreeMap.put(TreeMap.java:538) at java.util.TreeSet.add(TreeSet.java:255) at src.tools.RegionsCoverageFromWigCalculator.execute(RegionsCoverageFromWigCalculator.java:255) at src.tools.RegionsCoverageFromWigCalculator.main(RegionsCoverageFromWigCalculator.java:55)

ababaian commented 2 years ago

Can you list the first ~10 lines from both hg38.chr.size and for the repeat-masking file you are using? 0 regions loaded is suggesting there might be a difference in chromosome names.

Ysuita commented 2 years ago

Hi,

I'm also working on this issue with [smiriyala3]. Below you find the answer to your questions.

image image

Particularly, for hg38.chr.size, here is the rest of the file.

image

Would we need to change chromsome names in one of these files? or do we need to get a new file for one of them or both of them? I believe we got one of the files from UCSC website. Thanks for the help!

ababaian commented 2 years ago

So there was an issue where some versions of people's genomes use chr1 and other versions use 1 for chromosome naming. Can you check what is in the ./resources/<INDEX>/repeat file as well as the aligned .bam file. Are they all using the same name scheme?

Ysuita commented 2 years ago

That completely makes sense. But, I believe we have used only Emsemble for alignment, not Encode, so we should have consistent format for chromsome (ex. chr1, not 1 for chromsome).

Below is the list of files in ./resources//repeat. I was able to see head of one file (forChimericSearch), but not the rest.

image image

Here is the header of one of the bam files.

image

Is it itself the problem? the fact that the rest of the files doesn't get open by linux command? Or is some files are missing from ./resources//repeat folder?

ababaian commented 2 years ago

The issue is 0 regions loaded from file, can you post more of the output logs from the run, we'll need to isolate the exact line of code which is happening before the crash to try and run this one command in isolation

Ysuita commented 2 years ago

Here is the place that we saw the error.

image image

The thing is that as the 2nd screenshot shows, it returns that it "completes", but it emits error 15 "One of the East Lion Libraries didn't finish". Then, we went to check log and saw the error shows in the 1st screenshot, and this seems to be the 1st error in the entire script (e.g. "Run LIONS self-check procedures", "Set-up Project Workspace" didn't return errors, but at the beginning of "# ChimericReadSearch v1.0 # returned an error).

Would we need to look into java.lang.NullPointerException/RegionsCoverageFromWigCalculator? Which script potentially could produce this error?

ababaian commented 2 years ago

so it looks like eastLion.sh is getting to the point of running ChimericReadTool.sh script, and it finishes the python sub-script chimericReadSearch.py (LINE 147). This is the "heavy lifting" of the analysis, so that's the difficult part, the subsequent parts are for making summary statistics.

Is the chimericReadSearch.out file that is created empty? That is, there are no reads in your bam file which were identified at the boundaries of a TE?

The command which is failing is a few lines down in ChmiericReadTool.sh Line 237:

    $J -jar -Xmx6G $JAVA_BASE/RegionsCoverageFromWigCalculator.jar -w $fwig -r $PWD/tmp_uniq_repeat_coords -s $chrs -o $PWD -n results

so Java run of RegionsCoverageFromWigCalculator.jar


Also apologies for this hairball of code, this is what you get when a biologist is "learning to code". Reading all this back now it's not pretty.

Ysuita commented 2 years ago

Hi, based on your inputs, we have been trying different methods below.

First, we found out that chimericReadSearch.out is empty, and this seems to be caused by the failure of chimericReadSearch.py. And this failure might be due to misguide to the directory of bam file.

image

Although we specify the directory to bam file in our "input.list", argument for chimericReadSearch.py seems to pass only file name, not the pathway to the bam file (see the above screenshot). I tried changing this argument for bam file location in chimericReadSearch.py. However, this didn't work. I made sure to put index files as well. I still think that putting absolute pathway to the bam file would solve this issue, but could not identify how to fix variable for the bam file, even after I looked up ChimericReadTool.sh and eastlion.sh. It seemed that intput bam file for chimericReadSearch.py is an output of other command, which I could not identify.

Please let me know if the above guess on the pathway to the bam file seems to be reasonable for you. If so, please let me know how to fix it. If not, please let me know if there is any other solutions. Really appreciate your help!

ababaian commented 2 years ago

Ah! I have seen this before, again sorry about my clunky code. Last time the issue was that on some systems symbolic links are not working correctly. Try to find the respective $OUTPUT.bam which is also referred to as $libName.bam (so whatever you named the file on the input list).

See Line 429 of eastLion.sh

    bash $SCRIPTS/ChimericReadTool/ChimericReadTool.sh $OUTPUT.bam expression/wig/$libName.$QUALITY.wig.gz $REF
Ysuita commented 2 years ago

Hi, I was able to figure out $OUTPUT.bam (I changed it to $outDir/$OUTPUT.bam). However, ChimericReadTool.sh still seems not to work (e.g. chimericReadSearch.out file doesn't even get produced). Is it because of tmp.bed for chimericReadSearch.py? Other arguments for chimericReadSearch.py (aseembly_exons_2, forChimericSearch, bam file) seems to be fine. I checked that all these have contents, but I could not find contents for tmp.bed.

ababaian commented 2 years ago

The tmp.bed is the output of chimericReadSearch.py, which if it's not being made might be the issue, there are no reads at the intersection of your RepeatMasker / exon structures. Try and isolate the chimeriReadSearch.py command and look at the output.

Do you have some sort of positive control in this bam file which you can point to with certainty that a read/read-pair is found at the intersection of an annotated TE-Exon? My concern at the moment is that either the input data is sparse, or one of the annotations used are sparse such that no intersecting data is available.