Genentech / gReLU

gReLU is a python library to train, interpret, and apply deep learning models to DNA sequences.
https://genentech.github.io/gReLU/
MIT License
183 stars 17 forks source link

wrong input file pointed by the tutorial 3_train.ipynb #24

Open mhfzsharmin opened 1 month ago

mhfzsharmin commented 1 month ago

Is 3_train.ipynb pointing to the wrong input file? The following input gives an error on bedtools command despite the bedtools command exists

bw_file = grelu.data.preprocess.make_insertion_bigwig(
    frag_file = frag_file,
    plus_shift=0,
    minus_shift=1, # This corrects the +4/-5 Tn5 shift to a +4/-4 shift
    genome=genome,
    chroms="autosomes", # The output bigWig file contains coverage over autosomes.
)

I looked into the input file and it has GL instead of chr ---------file content--------------

GL000009.2  835 1045    SRR11442505 0   +
GL000009.2  1533    1725    SRR11442505 0   +
GL000009.2  3315    3453    SRR11442499 0   +
GL000009.2  3678    3746    SRR11442506 0   +
GL000009.2  4983    5061    SRR11442498 0   +
GL000009.2  4986    5059    SRR11442499 0   +
GL000009.2  5163    5263    SRR11442505 0   +
GL000009.2  7728    7874    SRR11442501 0   +
GL000009.2  7785    7893    SRR11442502 0   +
GL000009.2  9991    10173   SRR11442501 0   +

------------- error log---------------

Making bedgraph file
cat /gstore/home/sharmim1/artifacts/fragment_file:v0/Microglia_full.bed | awk -v OFS="\t" '{print $1,$2+0,$3,1000,0,"+";
    print $1,$2,$3+1,1000,0,"-"}' | sort -k1,1 | grep -e ^chr1 -e ^chr2 -e ^chr3 -e ^chr4 -e ^chr5 -e ^chr6 -e ^chr7 -e ^chr8 -e ^chr9 -e ^chr10 -e ^chr11 -e ^chr12 -e ^chr13 -e ^chr14 -e ^chr15 -e ^chr16 -e ^chr17 -e ^chr18 -e ^chr19 -e ^chr20 -e ^chr21 -e ^chr22  | bedtools genomecov -bg -5 -i stdin -g /gstore/home/sharmim1/.local/share/genomes/hg38/hg38.fa.sizes | bedtools sort -i stdin > ./Microglia_full.bedGraph
/bin/sh: line 1: bedtools: command not found
/bin/sh: line 1: bedtools: command not found
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In[14], line 1
----> 1 bw_file = grelu.data.preprocess.make_insertion_bigwig(
      2     frag_file = frag_file,
      3     plus_shift=0,
      4     minus_shift=1, # This corrects the +4/-5 Tn5 shift to a +4/-4 shift
      5     genome=genome,
      6     chroms="autosomes", # The output bigWig file contains coverage over autosomes.
      7 )

File ~/.conda/envs/grelu/lib/python3.11/site-packages/grelu/data/preprocess.py:697, in make_insertion_bigwig(frag_file, genome, out_prefix, plus_shift, minus_shift, chroms, tmp_dir, out_dir)
    695 cmd = open_cmd + shift_cmd + filter_cmd + bedgraph_cmd + sort_cmd
    696 print(cmd)
--> 697 subprocess.run(cmd, shell=True, check=True)
    699 # bedgraph file -> bigWig file
    700 print("Making bigWig file")

File ~/.conda/envs/grelu/lib/python3.11/subprocess.py:569, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    567     retcode = process.poll()
    568     if check and retcode:
--> 569         raise CalledProcessError(retcode, process.args,
    570                                  output=stdout, stderr=stderr)
    571 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command 'cat /gstore/home/sharmim1/artifacts/fragment_file:v0/Microglia_full.bed | awk -v OFS="\t" '{print $1,$2+0,$3,1000,0,"+";
    print $1,$2,$3+1,1000,0,"-"}' | sort -k1,1 | grep -e ^chr1 -e ^chr2 -e ^chr3 -e ^chr4 -e ^chr5 -e ^chr6 -e ^chr7 -e ^chr8 -e ^chr9 -e ^chr10 -e ^chr11 -e ^chr12 -e ^chr13 -e ^chr14 -e ^chr15 -e ^chr16 -e ^chr17 -e ^chr18 -e ^chr19 -e ^chr20 -e ^chr21 -e ^chr22  | bedtools genomecov -bg -5 -i stdin -g /gstore/home/sharmim1/.local/share/genomes/hg38/hg38.fa.sizes | bedtools sort -i stdin > ./Microglia_full.bedGraph' returned non-zero exit status 127.
​
avantikalal commented 1 month ago

Hi @mhfzsharmin, this doesn't seem to be a gReLU issue from the error log - this may occur if either bedtools is not installed, or it is not in your path, or perhaps its permissions are too restrictive. Could you check?

mhfzsharmin commented 1 month ago

@avantikalal The file exists in my path. Also I can access bedtools from the same virtual environment I am using. I gave the file content above. The lines inside the file starts with GL, not with chr which is what the grep command looking for. Since the grep command cannot find any **chr*** it generates an empty file and this empty file gives the bedttols error.

maggieeiggam commented 1 month ago

@mhfzsharmin I wonder if there's an error with the chroms="autosomes" parameter because of the trailing comma, which would explain why it's returning unlocalized contigs/scaffolds.

bw_file = grelu.data.preprocess.make_insertion_bigwig(
    frag_file = frag_file,
    plus_shift=0,
    minus_shift=1, # This corrects the +4/-5 Tn5 shift to a +4/-4 shift
    genome=genome,
    chroms="autosomes", # The output bigWig file contains coverage over autosomes.
)
avantikalal commented 1 month ago

@mhfzsharmin I just ran tutorial 3 and do not observe this error. The fragment file is not the issue as it does contain chr* chromosome names below the rows starting with GL - you will see them if you view the tail of the file instead of the head. The rows starting with GL are just ignored.

I still think this is likely an issue with the bedtools installation.